An Energy-efﬁcient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics

(1)

An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task

Characteristics

Ching-Chi Lin, You-Cheng Syu, Chao-Jui Chang, Jan-Jan Wu, Pangfeng Liu, Po-Wen Cheng and Wei-Te Hsu

Abstract—Energy-efficient task scheduling is a fundamental issue in many application domains, such as energy conservation for mobile devices and the operation of green computing data centers. Modern processors support dynamic voltage and fre- quency scaling (DVFS) on a per-core basis, i.e., the CPU can adjust the voltage or frequency of each core. As a result, the core in a processor may have different computing power and energy consumption. To conserve energy in multi-core platforms, we propose task scheduling algorithms that leverage per-core DVFS and achieve a balance between performance and energy consumption. We consider two task execution modes: the batch mode, which runs jobs in batches; and the online mode in which jobs with different time constraints, arrival times, and computation workloads co-exist in the system.

For tasks executed in the batch mode, we propose an algorithm that finds the optimal scheduling policy; and for the online mode, we present a heuristic algorithm that determines the execution order and processing speed of tasks in an online fashion. The heuristic ensures that the total cost is minimal for every time interval during a task’s execution.

Keywords-Energy-efficient; Task scheduling; Task Character- istics; Multi-core; DVFS

I. INTRODUCTION

Energy-efficient task scheduling is a fundamental issue in many application domains, e.g., energy conservation for mobile devices and the operation of green computing data centers. A number of studies have proposed energy-efficient techniques [1], [2]. One well-known technique is called dynamic voltage and frequency scaling (DVFS), which achieves energy savings by scaling down a core’s frequency and thereby reducing the core’s dynamic power.

Modern processors support DVFS on a per-core basis, i.e., the CPU can adjust the voltage or frequency of each core. As a result, the cores in a processor may have different computing power and energy consumption. However, for the same core, increased computing power means higher energy consumption.

The challenge is to find a good balance between performance and energy consumption.

Many existing works focus on specific application domains, such as real-time systems [3], [4], [5], multimedia applications [6], and mobile devices [7]. In this paper, we consider a broader class of tasks. Specifically, we classify task execution scenarios into two modes: the batch mode and the online mode. The workload of the former comprises batches of jobs; while that of the latter consists of jobs with different arrival times and time constraints. We divide the jobs in the online mode into two categories: interactive and non- interactive tasks. An interactive task is initiated by a user and

must be completed as soon as possible; while a non-interactive task may not have a strict deadline or response time constraint.

In the online mode, tasks can arrive at any time.

An example of the online mode is an online judging system where users submit their answers or codes to the server. After performing some computations, the server returns the scores or indicates the correctness of the submitted programs. In this scenario, user requests are interactive tasks that require short response times. Note that the response time refers to the acknowledgement of receipt of the user’s data, not the time taken to return the scores. By contrast, the computation of user data in non-interactive tasks is not subject to time constraints.

To conserve energy in multi-core platforms, we propose task scheduling algorithms that leverage per-core DVFS and achieve a balance between performance and energy consumption. In our previous work [8] , we have made the following contributions.

• We present a task scheduling strategy that solves three important issues simultaneously: the assignment of tasks to CPU cores, the execution order of tasks, and the CPU processing rate for the execution of each task. To the best of our knowledge, no previous work has tried to solve the three issues simultaneously.

• To formulate the task scheduling problems, we propose a task model, a CPU processing rate model, an energy consumption model, and a cost function. The results of simulations and experiments performed on a multi-core x86 machine demonstrate the accuracy of the models.

• For the execution of tasks in the batch mode, we propose an algorithm called Workload Based Greedy (WBG), which finds the optimal scheduling policy for both single-core and multi-core processors. Our experiment results show that, in terms of energy consumption, WBG achieves 46% and 27% improvement over two existing algorithms respectively, with only a 4% slowdown and 13% speedup in the execution time.

• To execute tasks in the online mode, we propose a heuris- tic algorithm called Least Marginal Cost (LMC), which assigns interactive and non-interactive tasks to cores. It also determines the processing speeds that minimize the total cost of every time interval during a task’s execution in an online fashion.

In this paper, we have improved our work with the following extensions.

• For batch mode, we define the “dominating position set”,

(2)

which determines the frequency of each batch task according to its position in the task queue. We also derive an algorithm that computes the dominating position ranges efficiently. The algorithm runs in Θ(|P |), where P is a non-empty set of discrete processing rates on a core.

• For online mode, we propse a method to compute the total cost efficiently with dynamic task insertion and deletion.

The time complexity of cost computation is Θ(1). With this algorithm, we can reduce the overhead of our Least Marginal Cost.

The remainder of this paper is organized as follows. In the next section, we formally define the models for executing tasks, determining the core processing rates, and reducing energy consumption. In Section III, we discuss the proposed Workload Based Greedy algorithm, which derives the optimal scheduling policy for the batch mode. We also provide a theoretical analysis to prove the optimality of the algorithm. In Section IV, we discuss the proposed heuristic algorithm, Least Marginal Cost, for the online mode. The experiment results are presented in Section V; Section VI reviews related works;

and Section VII contains our concluding remarks.

II. MODELS

A. Task Model

A task comprises a sequence of instructions to be executed by a processor. We define a task jk in a task set J as a tuple jk = (Lk, Ak, Dk), where Lk is the number of CPU cycles required to completejk,Akis the arrival time ofjk, andDkis the deadline ofjk. Ifjkhas a specific deadline,Dk> Ak ≥ 0;

otherwise, we setDk to infinity, which indicates thatjk is not subject to a time constraint.

For the batch mode, we make two assumptions about the tasks to be executed. First, tasks are non-preemptive, which means that a running task cannot be interrupted by other tasks. In practice, this non-preemptive property reduces the overheads of task switching and migration. Second, we assume that tasks are independent and can be scheduled in an arbitrary order. As the tasks in a batch are not interdependent and can be executed simultaneously , we assume that the arrival time, Ak, of every task is 0. The implication is that the scheduler has the timing information about all the tasks to be processed and it can run the tasks in any order based on the current scheduling policy.

For the online mode, we divide tasks into two categories according to their time constraints. Interactive tasks are those with early and firm deadlines, so the response time is cru- cial for such tasks. Non-interactive tasks are those with a late deadline or no deadline. We also make the following assumptions about online mode tasks. (1) The number of cycles needed to complete a task is known because it can be estimated by profiling. (2) The tasks are independent of each other. (3) A task can preempt other tasks that have a lower priority. In the online mode, a task’s priority depends on its category. Interactive tasks have higher priority than non- interactive tasks.

B. Processing Rate

Modern processors support dynamic voltage and frequency scaling (DVFS) on a per-core basis; therefore, each core in a processor may have its own processing rate or frequency.

LetP = {p1, p2, p3, . . .} be a non-empty set of discrete processing rates a core can utilize based on the hardware, with 0 < p1< p2< p3< . . . < p|P |. We use pjk from set P to denote the processing rate of a task jk. Different processors provide different processing rates. For example, the processing speeds offered by the Intel i7-950 processor range from 1.6, 1.73 to 3.06 GHz; while those of the ARM Exynos-4412 CPU range from 0.2, 0.3 to 1.7 GHz.

We also make some assumptions about the processing rate for the batch mode and online mode. In the batch mode, the processing rate of a CPU core does not change during a task’s execution. A core only switches to a new frequency when it starts a new task. By contrast, in the online mode, a core can change its processing rate any time based on the scheduling decisions.

C. Energy Consumption

For a task jk, let ek denote the energy consumption in joules; tk denote the execution time in seconds; and pjk be the processing rate used to execute jk. Recall that Lk the number of cycles needed to complete task jk. We define E(p) and T (p) as the energy and the time required to execute one cycle with processing rate p on a CPU core , with the property that0 < E(p1) < E(p2) < E(p3) < . . . and 0 < . . . < T (p3) < T (p2) < T (p1). Because we assume that a core’s processing rate is fixed while running a task in the batch mode, we can formulate the energy consumptionek and execution time tk of taskjk as shown in Equations 1 and 2.

ek = LkE(pjk) (1)

tk = LkT (pjk) (2) III. TASKSCHEDULING INTHEBATCHMODE

In this section, we discuss our energy-efficient task schedul- ing approach for batch mode execution. We consider running tasks with and without deadlines in both single core and multi- core environments. As a result, we have four combinations of tasks and environments. Below, we formally define the energy- efficient task scheduling problem in each of the four scenarios, and present our theoretical findings for the defined problems.

A. Tasks with Deadlines

We define the problem of scheduling tasks with deadlines on a single core as follows. Given a set of tasks jk, the number of execution cycles needed Lk, the tasks’ deadlines Dk, the possible processing rates of the core P , the energy consumption and time consumption functions E and T , and the total energy budget E^∗, the goal is to determine the execution order of the tasks and their processing rates, such that every task can be completed before its deadline and the overall energy consumption is less thanE^∗.

(3)

We call the above problem Deadline-SingleCore, and reduce the Partition problem to show that Deadline-SingleCore is NP- complete. LetA = {a1, . . . , an} be a set of positive integers.

The partition problem involves determining if we can partition A into two subsets so that the sums of the numbers in the two subsets are equal. Given a problem instanceA = {a1, . . . , an}, we construct a problem instance in Deadline-SingleCore such that one problem can be solved if and only if the other one has a solution.

Theorem 1: Deadline-SingleCore is NP-complete.

Proof: We construct a problem instance in Deadline- SingleCore as follows. There are n tasks j1, . . . , jn; and the number of cycles needed for the first n tasks is Li = ai, as in the given partition problem instance. We use S =Pn

i=1ai

to denote the total number of cycles required to complete n tasks. There are only two processing rates: low speed pl and high speed ph; the latter is twice as fast as the former. We also assume that T (pl) = 2, T (ph) = 1, E(ph) = 4 and E(pl) = 1, based on the assumption that the dynamic part of the energy consumption is proportional to the square of the frequency. This assumption follows the classical models in the literature [9]. The time constraint is 1.5S and the energy constraint is 2.5S. The deadline of every task is 1.5S.

Now we have1.5S time and 2.5S energy to run the n tasks.

BecauseT (ph) and T (pl) equal 1 and 2 respectively, we need to select the number of tasks whose sum is at least S/2 to run at high speedphso that all the tasks can be completed in 1.5S time. In addition, because E(ph) and E(pl) are 4 and 1respectively, we need to select the number of tasks whose sum is at least S/2 to run at low speed pl so that the energy consumption of all the tasks does not exceed 2.5S.

We conclude that the total number of cycles required for tasks that run at high speedphis the same as that for the tasks that run at low speedpl. Hence, the Partition problem can be solved if and only if a solution is found for our Deadline- SingleCore problem. The theorem follows.

Theorem 1 states that the problem of deciding the processing rate for tasks with deadlines under time and energy constraints on a single core is NP-complete. The singlecore results can be extended to multi-cores as follows. We define the Deadline-MultiCore problem in a similar way to the Deadline-SingleCore problem, and prove that it is NP- complete. We only consider two cores, each of which has the same speed p with T (p). The deadline constraint is set as S/2; we do not consider the energy constraint. The problem is exactly the same as partition problem; therefore, Deadline- MultiCore is also NP-complete.

Theorem 2: Deadline-MultiCore is NP-complete.

B. Tasks without Deadlines on a Single Core Platform Next, we consider the problem of scheduling tasks without deadlines. Given a set of tasks and the number of cycles needed to process them, the goal is to find the execution order and the processing rate for each task so that the overall “cost”

is minimized.

If we only consider energy consumption, we could run every task at the lowest processing rate in order to minimize the energy used, but it would degrade the performance. On the other

hand, running every task at the highest processing rate so as to minimize the total execution time would waste energy. Thus, the cost function must consider both the energy consumption and the execution time. Our cost function converts the two parameters into monetary values. We define the cost of a task jk as the sum of its energy cost and temporal cost. The energy costCk,eof taskjkis the amount of money paid for the energy used to execute the task. Recall thatE(pjk) is the amount of energy in joules required to execute one cycle with processing ratepjkon a core. Therefore, the total energy in joules needed to run a taskjk with processing ratepjkisLkE(pjk); and the amount of money paid for the energy isReLkE(pjk) as shown in Equation 3, whereReis a positive constant, which means the cost of a joule of energy.Recan be regarded as the amount paid for one joule of energy in an electricity bill.

Ck,e= ReLkE(pjk) (3) Similarly, we define the temporal cost Ck,t of task jk as the amount of money paid to compensate a user for waiting for his/her job to be completed. Recall thatT (pjk) is the time in seconds required to execute one cycle with processing rate pjk on a core, so the total time needed to run a task jk with processing rate pjk is LkT (pjk). Without loss of generality, we assume that the execution order of tasks is j1, . . . , jn; therefore, the turnaround time for task jk is comprised of the time waiting for j1, . . . , jk−1 to be completed and the execution time of jk itself. As a result, the turnaround time ofjk isPk

i=1LiT (pji). In addition, the temporal cost of task jk is Rt

Pk

i=1LiT (pji) as shown in Equation 4, where Rt

is also a positive constant, which means the amount paid for every second a user has to wait for the execution of his/her task. Rt can be regarded as an opportunity cost, which is the amount of money we could earn if we could move the resources elsewhere. Alternatively, Rt can be thought of as the amount a user is willing to pay for a computing service, such as Amazon EC2.

Ck,t= Rt

Xk i=1

LiT (pji) (4) To obtainCk, the cost of task jk, we combine the energy costCk,eand the temporal costCk,t. Using the weighted sum of the energy and flow time as the cost objective function is based on previous works [10], [11]. The total cost of all tasks, denoted asC, is calculated by Equation 8.

Ck = Ck,e+ Ck,t (5)

= ReLkE(pjk) + Rt

Xk i=1

LiT (pji) (6)

C =

Xn k=1

Ck (7)

= Xn k=1

(ReLkE(pjk) + Rt

Xk i=1

LiT (pji)) (8) Equation 8 is difficult to analyze due to the interaction between a task and all the tasks ahead of it in the execution

(4)

sequence. We consider this problem from another perspective.

Instead of computing the waiting time caused by other tasks (the second term in Equation 6), we compute the amount of delay that a task causes for other tasks. Consider a task jk. Ifjk runs at processing rate pjk, the energy cost will be ReLkE(pjk) and the time cost will be (n−k +1)RtLkT (pjk) for itself and the tasks after it. That is, the temporal cost of jk will be RtLkT (pjk), and that of task jk+1 will be RtLkT (pjk+1), and so on. We can rewrite Equation 8 as follows:

C =

Xn k=1

(ReLkE(pjk) + (n − k + 1)RtLkT (pjk)) (9)

= Xn k=1

(ReE(pjk) + (n − k + 1)RtT (pjk))Lk (10)

= Xn k=1

C(k, pjk)Lk (11)

Note that we define C(k, p) as

C(k, p) = ReE(p) + (n − k + 1)RtT (p) (12) Now we can rewrite Equation 8 as follows:

C = Xn k=1

C(k, pjk)Lk (13) Equation 13 shows that when we want to minimize C(k, pjk) in order to minimize C, it is not necessary to considerLk, the number of cycles needed to execute taskjk. We only need to find the processing rate pjk that minimizes C(k, pjk) for each k. In other words, the minimum value of C(k, pjk) only involves k, the position of the task in the execution sequence, and it is independent of the task assigned to that position.

Lemma 1: The decision about the processing rate pjk required to minimizeC(k, pjk) only depends on k, the position of the task in the execution sequence.

Lemma 1 implies that we can calculate the minimum C(k, pjk) for each k beforehand if P , E, T , Re, and Rt

are known. As a result, we can find the optimal pjk that will minimize each C(k, pjk), and then determine the best processing rate without any knowledge of the workload.

Lemma 2: LetC(k) = minp∈PC(k, p), C(k) is a decreasing function ofk, i.e. C(k + 1) < C(k).

Proof: Let the optimal processing rate of C(k) be p. If we use p as the processing rate for C(k + 1), then we have C(k +1, p)−C(k, p) = −RtT (p) < 0. Therefore, C(k +1) ≤ C(k + 1, p) < C(k, p) = C(k).

Lemma 3: For any four real numbersa, b, x, y where a ≥ b andx ≥ y, then ax + by ≥ ay + bx.

Proof:

(a − b)(x − y) ≥ 0 (14)

=⇒ ax − ay − bx + by ≥ 0 (15)

=⇒ ax + by ≥ ay + bx (16)

Theorem 3: There exists an optimal solution with the min- imum cost, where the tasks are in non-decreasing order of the number of cycles.

Proof: Recall that C(k) = minp∈PC(k, p). From Lemma 1, we know that when the task execution order is j1, . . . , jn, the minimum total cost is as follows:

C = Xn k=1

C(k)Lk (17)

From Lemma 2, we know thatC(k) is a decreasing function of k; and from Lemma 3, we know that the cost will not increase if a task with a small number of cycles is switched with a task in front of it that has more cycles. By repeating this process on an optimal solution until there are no tasks to switch, the tasks will be in non-decreasing order of the number of cycles required to complete them, and it is still an optimal solution.

To eliminate n (the number of tasks) for generality, we define L^B_k, j^B_k, C^B(k, p), C^B(k) as follow: (“B” means backward)

L^B_k = Ln−k+1 (18)

j_k^B= jn−k+1 (19)

C^B(k, p) = C(n − k + 1, p) = ReE(p) + kRtT (p) (20) C^B(k) = min

p∈PC^B(k, p) (21) Define fi(k) = C^B(k, pi) = (ReE(pi)) + (RtT (pi))k.

By definition, pi is the best choice for j^B_k i.f.f. fi(k) = minsfs(k). By Inequation 25, we know that for any two different processing rates pa and pb with pa < pb, pb is no worse thanpa i.f.f.k ≥ ^R_R^e_t^(E(p_{(T (p}_a^b^)−E(p_{)−T (p}^a_b₎₎⁾⁾.

fa(k) ≥ fb(k) (22)

⇐⇒ ReE(pa) + RtT (pa)k ≥ ReE(pb) + RtT (pb)k(23)

⇐⇒ RtT (pa)k − RtT (pb)k ≥ ReE(pb) − ReE(p(24)a)

⇐⇒ k ≥ Re(E(pb) − E(pa))

Rt(T (pa) − T (pb)) (25) DefineDp, the “dominating position set” ofp, to be the set ofk such that p is the best choice for j_k^B (choose the higher processing rate in case of a tie). (So Dp1, Dp2, . . . , Dp|P | is a partition of natural numbers N) To findDp for eachp ∈ P is equivalent to find the lower envelope off1, f2, f3, . . . , f|P |. Due to fi are linear functions with y = b + ax form, it is equivalent to find the lower (convex) hull in the dual space (transform liney = b + ax to point (a, b)), and the elements in each dominating position set will be consecutive. (So we can call “dominating position set” as “dominating postition range”

instead.) In conclusion, we can find the dominating position ranges efficiently via Algorithm 1, which runs in Θ(|P |).

The pseudo code of the task ordering algorithm is detailed in Algorithm 2, which runs in O(|J| log |J|). ( ˆP = {p|p ∈ P ∧ Dp6= ∅} = {ˆp1, ˆp2, . . . , ˆp_{| ˆ}_P|}; ˆp1< ˆp2< . . . < ˆp_{| ˆ}_P|)

(5)

Algorithm 1 Finding Dominating Position Ranges Input: P, E, T, Re, Rt

Output: Dp (p ∈ P ), ˆP

1: functioncross(t0, t1, t2)

2: return(t1.x − t0.x)(t2.y − t0.y) − (t2.x − t0.x)(t1.y − t0.y)

3: end function

4: forp ∈ P do

5: Dp← ∅

6: end for

7: P ← ∅ˆ

8: initialize a stackS; top ← 0

9: fori ← 1 to |P | do

10: t ← (p = pi, x = RtT (pi), y = ReE(pi))

11: whiletop ≥ 2 and cross(s[top − 1], s[top], t) ≥ 0 do

12: top ← top − 1

13: end while

14: top ← top + 1

15: S[top] ← t

16: end for

17: lb ← 1

18: fori ← 1 to top − 1 do

19: nlb = ⌈s[i+1].y−s[i].y s[i].x−s[i+1].x⌉

20: iflb < nlb then

21: Ds[i].p← [lb, nlb)

22: P ← ˆˆ P ∪ {s[i].p}

23: end if

24: lb ← nlb

25: end for

26: Ds[top].p← [lb, ∞)

27: P ← ˆˆ P ∪ {s[top].p}

Algorithm 2 Longest Task Last Input: J, ˆP , E, T, Re, Rt

Output: The execution order O (a list of pairs of tasks and corresponding processing rates that minimize the total cost)

1: Find dominating position ranges via Algorithm 1

2: Sort the tasks inJ to make L^B_k in non-increasing of k

3: initialize an empty listO

4: forp ∈ ˆP in ascending order do

5: ifDp∩ [1, |J|] = ∅ then break

6: fork ∈ Dp∩ [1, |J|] in ascending order do

7: O ← {(j_k^B, p)} + O // concatenate

8: end for

9: end for

C. Scheduling Tasks without Deadlines on Multi-core Plat- forms

If the cores in a multi-core system are the same type, it is called a homogeneous multi-core system. In this sub- section, we consider scheduling tasks in homogeneous multi- core systems and heterogeneous multi-core systems.

The cores in a homogeneous multi-core system have the same energy consumption and time consumption functionsE

and T ; hence, they have the same C function, as defined in Equation 12. For ease of discussion, we use an index k^′ = n− k + 1 on C^′function to describe Equation 12. The index is defined in Equation 26. The rationale is thatk starts counting from the beginning of the sequence, whilek^′ starts from the end of the sequence;k − 1 is the number of tasks in front of the current task, andk^′− 1 is the number of tasks behind it.

C^′(k^′, pjk′) = C(n − k + 1, pjn−k+1) (26) First, we describe scheduling tasks in a homogeneous multi- core system with R cores. From Lemma 2 we know that C(k) is a decreasing function of k; therefore, C^′(k^′) is a non-decreasing function of k^′. As a result, we use k^′ = 1 to schedule the R heaviest tasks, i.e., those that require the highest numbers of cycles, on theR cores first. The last task on each of the R cores will be one of these tasks, which implies they will be multiplied by the smallest C^′(1) so that they contribute the least amount to the total cost. Then, we use k^′ = 2 to schedule the next R heaviest tasks on the R cores in the same manner. We use this round-robin technique to assign tasks so that those with a larger number of cycles are assigned smallerk^′, and therefore smallerC^′(k^′). Using a similar argument to that in Theorem 3 we derive the following theorem.

Theorem 4: A round-robin scheduling technique that as- signs heavier tasks to smaller k^′ yields the minimum cost in a homogeneous multi-core system

Next, we consider a heterogeneous multi-core system in which all the cores may have different energy consumption and time consumption functions E and T . We extend C^′(k^′) for a single-core system toC_j^′(k^′) for a heterogeneous multi-core system, wherej is the index of a core and 1 ≤ j ≤ R. Tasks are assigned to cores in a greedy fashion. First, we compute C_j^′(1) for 1 ≤ j ≤ R. That is, we compute the lowest C^′ value among all cores if we place a task on that core, and we usej^∗to denote the core index whereC^′ is minimized. From the discussion in Theorem 4, we know that the heaviest task should be placed on corej^∗. We then find the minimum among C_j^′(1) for j 6= j∗ and C_j^′^∗(2), and assign the second largest task to that core. This process is repeated until all the tasks have been assigned to cores. Note thatCj^′(k^′) is independent of the task workload and can be computed in advance for all cores. In practice, we can build a minimum heap to store the C_j^′(k^′) we are considering. In each round, we take the minimum from the heap and add the nextC^′(k^′) to the heap from the core that the task is assigned to. Using a similar argument to that in Theorem 4, we can show that this greedy method yields the minimum total cost. We have the following theorem.

Theorem 5: Using a greedy scheduling algorithm to assign heavier tasks to cores with smallerC^′(k^′) yields the minimum cost in a heterogeneous multi-core system.

The pseudo code of the greedy algorithm is detailed in Algorithm 3.

IV. TASKSCHEDULING IN THEONLINEMODE

In this section, we use an example to describe task schedul- ing in the online mode. Then, we formally define the problem

(6)

Algorithm 3 Workload Based Greedy

Input: tasksj1, . . . , jn; the number of cyclesL1, . . . , Ln;the set of processing rateP ; the energy consumption and time consumption functionsE and T for all R cores; Re, the cost of a joule of energy; and Rt, the amount paid for every second a user has to wait.

Output: An execution sequence Sj for each of the R cores, and the processing rates that minimizes the total cost of each task.

1: Sort the tasks byLiin decreasing order. Let the new order of tasks bej₁^′, . . . , j_n^′.

2: Initialize a heapH with C_j^′(1), for j between 1 and R.

3: for each taskj_i^′ do

4: Delete the minimum elementC_j^′∗(k^′) from heap H.

5: Assignj_i^′to thek^′-th position in the execution sequence of corej^∗, and set its processing rate to the one that minimizes and definesC_j^′∗(k^′).

6: AddC_j^′∗(k^′+ 1) to H.

7: end for

of energy efficient task scheduling in the online mode and present our heuristic algorithm.

As mentioned in Section I, an online judging system is a good example of scheduling in the online mode. In such systems, users submit their answers or codes to the server;

and after some computations, the server returns the scores or indicates the correctness of the submitted programs. User requests are interactive tasks that require short response times.

Recall that the response time refers to the acknowledgment of receipt of the user’s data, not the time taken to return the scores. By contrast, the computations of users’ data are non- interactive tasks that do not have strict deadlines. Because each non-interactive task may be submitted by a different user, the performance should be considered in terms of the completion time of each task instead of the makespan of executing all tasks.

We formally define the problem as follows. There are two types of tasks in the online mode: interactive and non- interactive. Interactive tasks are submitted by users and have strict deadlines, so they must be completed as soon as possible.

By contrast, non-interactive tasks may not have strict deadlines or response time constraints. Tasks can arrive at any time. The goal of scheduling is to assign tasks to cores and determine the processing speeds that minimize the total cost for every time interval during the execution of tasks.

We make the following assumptions about the system. (1) The system can be a homogeneous or a heterogeneous multi- core system. (2) There is an execution queue for each core. The scheduler assigns a task to a core based on our policy. (3) The execution order of tasks in the same queue can be changed. (4) A task can only be preempted by a task with higher priority.

(5) Interactive tasks have higher priority than non-interactive tasks. The cost function in the online mode considers both the execution time and the energy consumption.

Note that the Workload Based Greedy algorithm can be used to redistribute all tasks to cores when a new task arrives.

According to Theorem 5, rearranging the tasks yields the

minimum cost. However, because the overhead incurred by the time and energy used to migrate tasks could impact the performance, we need a lightweight strategy without task migration. To this end, we designed a heuristic algorithm, called Least Marginal Cost, to schedule both interactive and non-interactive tasks. The algorithm assigns each newly ar- rived task to a core, and determines the optimal processing frequency for the task. When a new task arrives, the scheduler finds an appropriate core for it. The most appropriate core is the one that yields the lowest marginal cost if the newly arrived task is executed on that core. We apply different strategies according to the type of task.

If the newly arrived task is interactive, it must be completed as soon as possible. Thus, the scheduler chooses a core that is executing a task with lower priority, preempts that task, and executes the new task instead. After finishing the new interactive task, the scheduler resumes the pre-empted task.

The execution order of other tasks in the queue is unchanged.

The increasing cost ofC_j^M on corej is calculated as follows:

C_j^M = ReLiEj(pm) + RtLiTj(pm) + RtLiTj(pm)Nj (27) In Equation 27, C_j^M denotes the marginal cost incurred if we run an interactive task on core j; Rt and Re are the amounts paid for every second and every joule of energy respectively; Li is the number of cycles needed to complete the interactive task.Ej(pm) andTj(pm) are, respectively, the energy consumption and the time taken for a cycle under the maximum frequency pm of core j; and Nj is the number of (non-interactive) tasks waiting in the queue on core j.

The Least Marginal Cost algorithm compares the marginal costs and assigns the interactive task to core j^∗, where C_j^M∗ = min C_j^M. Note that if the cores are homogeneous, we simply choose the core with the leastNj.

If the new task is non-interactive, it is added to the execution queue instead of preempting the current task. The marginal cost of adding a non-interactive task to a core depends on the position of the new task in queue. According to Theorem 3, the optimal solution with the minimum cost is derived when the tasks are in non-decreasing order of the number of cycles.

Because the tasks in a queue are sorted in non-decreasing order, we can perform a binary search to find the positionk for the new task. The tasks are still in non-decreasing order after the new task is added.

Least Marginal Cost chooses a core for the new non- interactive task in a greedy fashion. First, the scheduler finds kj for each corej, where kj is the position that the task will be inserted. It then estimates the marginal costCj^M if the new task is inserted in position kj. The core j^∗ with the lowest marginal cost will be chosen. The new task is inserted in the kj^∗-th position on core j^∗. The processing frequency of each task on corej^∗ is adjusted according toC(k, pk), where k is the position of the task in the execution sequence.

A. Dynamic Task Insertion and Deletion

In Section III-B we show that how to schedule tasks without deadlines on single-core platform with all tasks known in the

(7)

very beginning. In this section, we are going to show how to schedule tasks and compute the total cost efficiently with dynamic task insertion and deletion.

First, we define three functions:

ξ([a, b]) = Xb k=a

L^Bk (28)

∆([a, b]) = Xb k=a

(k − a + 1)L^B_k (29)

γ([a, b]) = Xb k=a

kL^B_k = ∆([a, b]) + (a − 1)ξ([a, b]) (30)

Then, we can rewrite the total cost C:

C =

Xn k=1

(ReL^B_kE(p_j^B

k) + kRtL^B_kT (p_j^B

k)) (31)

= X

p∈ ˆP

(ReE(p)ξ(Dp) + RtT (p)γ(Dp)) (32)

So if we can build a sorted data structure (sorted byL^B_k in descending order) with efficient insertion, deletion, and ξ, ∆ queries, then we can do dynamic scheduling (dynamic tasks insertion and deletion) efficiently.

For any two nearby range [L, M ] and [M + 1, R] with ξ and ∆ known, we can compute the ξ([L, R]) and ∆([L, R]) as following. There is no need to know each L^B_k in [L, R]

(associative property):

ξ([L, R]) = ξ([L, M ]) + ξ([M + 1, R]) (33)

∆([L, R]) = ∆([L, M ]) + ∆([M + 1, R])+

(M + 1 − L)ξ([M + 1, R]) (34) A single 1D range tree will be a suitable data structure here.

It supports insertion, deletion, and range queries for things with associative property inO(log N ), where N is the number of tasks it contains. (1D range tree is simplest case of range tree, it’s basically a balanced binary search tree, with each node keeps (1) the number of nodes, (2)ξ, (3) ∆, of its subtree to meet our requirements.) With this data structure, we can perform insertion and deletion in O(log N ) and compute the total cost inO(| ˆP | log N ) (via Equation 32).

This result is still improvable. We can keep (1) two pointers to the lower boundary node and the upper boundary node, (2) ξ, (3) ∆, for each dominating position ranges. Once an insertion or deletion occur, we can maintain these information inO(| ˆP | + log N ) due to the fact that the number of different elements of the set of tasks before and after an insert/delete operation for each dominating position ranges is no more than two. So the time complexity of insertion and deletion increase toO(| ˆP | + log N ) and the cost computation decrease toΘ(1) due to C is maintained in insertion or deletion. Notice that Θ(1) predecessor and successor operation are required to achieve this time complexity, we can do it by keep the predecessor and the successor pointer in each node like in doubly linked list. In addition, dynamic ∆ range query is no longer required, While ξ range query (simple range sum) is still needed.

Algorithm 4 Initialize for Single-Core Dynamic Scheduling Input: ˆP , E, T, Re, Rt

Output: D, Z, α, β, a, b, x, d, C

1: Find dominating position rangesD via Algorithm 1

2: Initialize an empty 1D range treeZ (sorted in descending order)

3: C ← 0

4: fori ← 1 to | ˆP | do

5: αi← NULL; βi← NULL

6: ai← lowerbound(Dpˆi); bi← ai− 1

7: xi← 0; di← 0

8: end for

Algorithm 5 Insert a task

Input: L: the number of cycles of the incoming task

1: ptr ← insert(Z, L)

2: k^B← rank(ptr)

3: i ← arg_ik^B∈ Dpˆi

4: if k^B= ai then αi ← ptr

5: if k^B> bi thenβi← ptr

6: bi← bi+ 1

7: xi← xi+ L

8: di← di+(k^B−ai+1)×∗ptr+ range sum(Z,[k^B+1, bi])

9: while bi> upperbound(Dpˆi) do

10: ptr ← βi

11: di← di− (bi− ai+ 1) × ∗ptr

12: xi← xi− ∗ptr

13: bi← bi− 1

14: βi← predecessor(βi)

15: i ← i + 1

16: αi← ptr

17: if ai> bi then βi← ptr

18: bi← bi+ 1

19: xi← xi+ ∗ptr

20: di← di+ xi 21: end while

22: C ←P| ˆP|

i=1ReE(Dpˆi)xi+ RtT (Dpˆi)(di+ (ai− 1)xi)

V. EVALUATION

Our experimental platform is a quad-core x86 machine that supports individual core frequency tuning. The CPU model is Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz. Each core has 12 frequency choices. The power consumption is measured by a power meter, model DW-6091. The energy consumption is the integral of the power reading over the execution period.

Because other components in the system consume energy, we first measure the power consumption of an idle machine and deduct the idle power reading from our experiment results.

The frequencies of cores are computed according to our algorithm before each experiment. To prevent interference from the Linux OS frequency governor, we disable the automatic core frequency scaling of Linux. The DVFS mechanism can be disabled by setting the content in

“/sys/devices/system/cpu/cpuX/cpufreq/scaling governor”

to userspace, where “X” is the CPU number. Since the

(8)

Algorithm 6 Delete a task Input: ptr

1: k^B ← rank(ptr)

2: leti be the maximum integer s.t. ai≤ bi 3: whileai> k do

4: tptr ← αi 5: di← di− xi 6: xi← xi− ∗tptr

7: bi← bi− 1

8: ifai≤ bi then

9: αi← successor(αi)

10: else

11: αi← NULL

12: βi← NULL

13: end if

14: i ← i − 1

15: βi← tptr

16: bi← bi+ 1

17: xi← xi+ ∗tptr

18: di ← di+ (bi− ai+ 1) × ∗tptr

19: end while

20: di← di−(k^B−ai+1)×∗ptr+ range sum(Z,[k^B+1, bi])

21: xi← xi− ∗ptr

22: bi← bi− 1

23: ifai> bi then

24: αi← NULL

25: βi← NULL

26: else ifαi= ptr then

27: αi← successor(αi)

28: else ifβi= ptr then

29: βi← predecessor(βi)

30: end if

31: delete(ptr)

32: C ←P| ˆP|

i=1ReE(Dpˆi)xi+ RtT (Dpˆi)(di+ (ai− 1)xi)

DVFS mechanism is invalid, we can set the frequency of an individual core by changing the content in

“/sys/devices/system/cpu/cpuX/cpufreq/scaling setspeed”.

However, the frequency choices are limited to those in

“/sys/devices/system/cpu/cpuX/cpufreq/scaling available frequencies”.

After setting the frequency of core

“X”, we can verify the change from

“/sys/devices/system/cpu/cpuX/cpufreq/scaling cur freq”.

The power consumption of a core is related to its frequency and voltage. In our experiments, we assume that each frequency has a corresponding voltage. By adjusting the frequency, a core will automatically change its operating voltage, thus results in different power consumptions.

A. Experiment Results for the Batch Mode

In this sub-section, we first verify the accuracy of the energy and cost models. Then, we evaluate the effectiveness of the Workload Based Greedy algorithm by comparing it with two existing algorithms.

TABLE I

AVERAGEEXECUTIONTIMES OFTHEWORKLOADS(SECONDS)

Benchmark train input ref. input perlbench 43.516 749.624

bzip 98.683 1297.587

gcc 1.63 552.611

mcf 17.568 397.782

gobmk 189.218 993.54

hmmer 109.44 1106.88

sjeng 224.398 1074.126 libquantum 5.146 1092.185 h264ref 218.285 1549.734 omnetpp 108.661 439.393

astar 191.073 880.951 xalancbmk 142.344 453.463

TABLE II

PARAMETERS IN BATCH MODE

pk 1.6 2.0 2.4 2.8 3.0

E(pk) 3.375 4.22 5.0 6.0 7.1 T(pk) 0.625 0.5 0.42 0.36 0.33

1) Experiment Settings: In the batch mode, we use SPEC2006int, which contains 12 benchmarks, each with train and ref inputs. As a result, we have 24 different workloads. The computation cycles needed by each workload are computed as follows. First, we measure the execution time by running each workload ten times on a core with the lowest frequency (1.6 GHz) and compute the average execution time. Then, we estimate the cycles needed by multiplying the average execution time by the core frequency, which is the number of cycles per second. Table I shows the average execution times of the workloads.

Table II shows the parameters used in the batch mode.

Recall thatE(pk) is the amount of energy in joules required to execute one cycle on a core with processing ratepk; andT (pk) is time in seconds needed to execute one cycle with processing ratepk. To obtain the values ofE(pk), we measure the power consumption of a core with 100% loading using differentpk, and divide the result bypk. Note thatT (pk) is equal to 1/pk. We setReandRtaccordingly.Re is the amount paid for a joule of energy, andRtis the amount paid to a user for every second he/she has to wait for a task’s execution.

2) Model Verification: To evaluate the accuracy of the energy model and cost model, we conduct simulations and experiments. We use two frequencies, 1.6 GHz and 3.0 GHz;

Re is set at 0.1 cent per joule and Rt is set at 0.4 cents per second. We take the 24 workloads in Table I as input tasks. Each workload is executed once. The Workload Based Greedy algorithm is used to generate an optimal scheduling plan, including execution order and frequencies, for the input workloads.

The simulator takes the average execution time of the tasks and the parameters in Table II as inputs, simulates the execution of the tasks based on the scheduling plan, and generates the energy cost, time cost, and total cost.

In the experiment, we execute the 24 workloads on the quad- core x86 machine according to the scheduling plan generated by the Workload Based Greedy algorithm, and collect the time and energy data with a power meter. Then, we convert the

(9)

collected data into the cost and compare it with the simulation result.

Figure 1 shows the comparison results between simulation and experiment. The actual cost of executing the workloads on the x86 machine is about 8% higher than the simulation result.

There are two possible reasons. The first one is that even if workloads are running simultaneously on different cores, they can still affect each other, e.g., by competing for last-level cache or memory. Such resource contention may cause longer execution times, and result in higher energy consumption and overall cost. The second reason is that running different tasks require different resources such as cache or memory operations. Doubling the processing speed of a task does not guarantee exactly half of the execution time. However, we deem the 8% discrepancy acceptable because considering all the system-level overheads would make the scheduling problem too complicated to analyze.

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Sim Exp

Normalized Cost (%)

Time

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Sim Exp Energy

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

Sim Exp Total Cost Time Energy

Fig. 1. Comparison of The Simulation And Experimental Results

3) Comparison with Other Scheduling Methods: We con- ducted experiments to compare the cost of our scheduling scheme with that of Opportunistic Load Balancing (OLB) [12]

and Power Saving. OLB schedules a task on the core with the earliest ready-to-execute time. The main objective of OLB is to ensure the cores are fully utilized and finish the tasks in the shortest possible time. Power Saving, which restricts the frequency of a core to conserve energy, is widely used in the power-saving mode of mobile devices. For dynamic frequency scaling, we set the Linux frequency governor to

“On-demand” for both OLB and Power Saving. If a core’s loading is higher than 85%, the frequency governor increases the core’s frequency to the largest available selection. On the other hand, if the loading is lower than the threshold, the frequency governor reduces the processing frequency by one level. The loading of a core is measured every second.

First, we generate the scheduling plans with Workload Based Greedy, Opportunistic Load Balancing, and Power Saving. Then, we execute the plans on the experimental platform and measure their costs. In this experiment, we limit the available frequencies in Power Saving to the lower half of the CPU frequency range, i.e., 1.6, 2.0, and 2.4 GHz. As in the previous section, we set Re at 0.1 cent per joule andRt

at 0.4 cents per second.

Figure 2 shows the cost of the three scheduling plans.

Workload Based Greedy consumes 46% less energy than Opportunistic Load Balancing with only a 4% slowdown in the execution time. The total cost reduction is about 27%.

Compared with Power Saving, Workload Based Greedy con- sumes 27% less energy and improves the execution time by 13%. The main reason is that our Workload Based Greedy schedules the shortest tasks first with a high processing rate, thus reduce the waiting time of other tasks as little as possible.

Also we apply slower processing rate to larger tasks, hence reduce the energy consumption.

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

WBG OLB PS

Normalized Cost (%)

Time

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

WBG OLB PS Energy

0.1 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

WBG OLB PS Total Cost Time Energy

Fig. 2. Cost Comparison of Different Scheduling Methods

B. Experiment Results for the Online Mode

We conduct a trace-based simulation to verify the accuracy of the Least Marginal Cost algorithm. The experimental en- vironment is the same as that in Section V-A. The workload is a piece of the trace from Judgegirl [13], which is an online judging system in National Taiwan University. Students submit their codes to the system in order to solve different problems.

The length of the trace is half hours during the final exam, which includes five problems. We treat the score querying requests from students as interactive tasks, and the codes they submit as non-interactive tasks. There are 768 non-interactive tasks and 50525 interactive tasks in the trace.

In our trace-based simulation, the number of CPU cycles required of each task can be retrieved from the trace. In practice, we can estimate the resource requirement by profiling or historical data. The interactive tasks in an online judging system are mostly problem choosing and score querying.

We can profile the CPU cycles required to complete these kinds of tasks while building the system. On the other hand, the resource requirement of a non-interactive task strongly depends on the code submitted by users. However, we can still predict the resource requirement of a newly arrival non- interactive task by taking average of the previous completed submissions.

We build an event-driven simulator. The simulator takes the workload trace as input. An event can be either a task arrival or a task completion. For task arrival, the simulator decides the core and frequency for the new task according to the scheduling algorithm. On the other hand, the simulator calculates the cost of time and energy of the task for a task completion event. The total cost is the cost summation of every tasks in the trace.

We compare the cost of the following scheduling strate- gies: Opportunistic Load Balancing, On-demand, and Least Marginal Cost. Opportunistic Load Balancing(OLB) [12]

schedules a task on the core with the earliest ready-to-execute

(10)

time. The objective of OLB is to ensure the cores are fully utilized and finish the tasks in the shortest possible time. OLB keeps the processing frequency of each core at the highest level. On-demand [14] is a strategy in Linux that decides the processing frequency according to the current core loading.

Once a core’s loading reaches a predefined threshold, On- demand scales to the highest processing frequency of that core.

On the other hand, if the loading is lower than threshold, On- demand reduces the processing frequency by one level. Since On-demand does not schedule tasks to core, we assign the arriving tasks to core in a round-robin fashion. In OLB and On-demand, interactive tasks have higher priority than non- interactive tasks. Tasks on a core with the same priority will be executed in a FIFO fashion. Least Marginal Cost is the strategies we propose. Re and Rt are set to 0.4 cents per joule and 0.1 cent per second, respectively.

Figure 3 shows the cost comparison among scheduling methods. Least Marginal Cost method consumes 11% less energy and spends 31% less time than Opportunistic Load Balancing, and has 17% less total cost. Similarly, Least Marginal Cost method consumes 11% less energy, spends 46% less time than the On-demand method, and has 24%

less total cost. The results indicate that Least Marginal Cost heuristic saves energy and reduces task waiting time than existing algorithms.

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

LMC OLB OD

Normalized Cost (%)

Time

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

LMC OLB OD Energy

0.1 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

LMC OLB OD Total Cost Time Energy

Fig. 3. Cost Comparison of Different Scheduling Mthods

VI. RELATEDWORK

Dynamic Voltage and Frequency Scaling (DVFS) is a key technique that reduces a CPU’s power consumption. There have been several studies of using DVFS, especially for applications in real-time system domains. The objective is to ensure that such applications can be executed in real-time systems without violating their deadline requirements, while minimizing the energy consumption. Yao et al. [4] proposed an offline optimal algorithm as well as an online algorithm with a competitive ratio for aperiodic real-time applications. Pillai et al. [15] presented a class of novel real-time DVS (RT-DVS) algorithms, including Cycle-conserving RT-DVS and Look- Ahead RT-DVS. Aydin et al. [5] developed an efficient solution for periodic real-time tasks with (potentially) different power consumption characteristics. However, the above works only consider single-core real-time systems, so they do not deal with the assignment of tasks to cores.

A number of recent studies of DVFS scheduling in real- time systems focused on multi-core processors. Kim et al. [3]

proposed a dynamic scheduling method that incorporates a DVFS scheme for a hard real-time heterogeneous multi-core environment. The objective is to complete as many tasks as possible while using the energy efficiently. Yang et al. [16]

designed a 2.371 approximation algorithm that reduces the amount of energy used to process a set of real-time tasks that have common arrival times and deadlines on a chip multi- processor. Lee [17] introduced an energy-efficient heuristic that schedules real-time tasks on a lightly loaded multi- core platform. The above works focus on real-time systems in which tasks behave in a periodic or aperiodic manner.

By contrast, the tasks considered in this paper are general computation tasks with or without deadlines.

In addition to real-time systems, some studies have investigated using DVFS for other purposes. To minimize energy consumption, Bansal et al. [18] performed a comprehensive analysis and proposed an online algorithm to determine the processing speed of tasks with deadlines on a processor that has arbitrary speeds. Pruhs et al. [19] investigated a problem setting where a fixed energy volume E is given and the goal is to minimize the total flow time of the tasks. They considered the offline scenario where all the tasks are known in advance and showed that optimal schedules can be computed in polynomial time. Albers et al. [10] developed an approach that uses a weighted combination of the energy and flow time costs as the objective function and exploits dynamic programming to minimize it in an offline fashion. This is similar to our approach, except that Albers et al. consider unit- size tasks, whereas our tasks can be any arbitrary size. Lam et al. [11] also studied scheduling to minimize the flow time and energy usage in a dynamic speed scaling model. They devised new speed scaling functions that depend on the number of active jobs, and proposed an online scheduling algorithm for batched jobs based on the new functions. The above works focus on single processors with discrete speeds. However, we consider both single and multi-core architectures in a per-core DVFS fashion.

Other energy-efficient algorithms have been proposed for multi-core platforms. Bunde [20] investigated flow time mini- mization in multi-processor environments with a fixed amount of energy. Aupy et al. [21] performed a comprehensive analysis of executing a task graph on a set of processors. The goal is to minimize the energy consumption while enforcing a prescribed bound on the execution time. They considered different task graphs and energy models. In contrast to our approach, their method assumes that the mapping of tasks to cores is given, which is different to our approach.

VII. CONCLUSION

In this paper, we propose effective energy-efficient scheduling algorithms for multi-core systems with DVFS features. We consider two task execution modes: the batch mode for batches of jobs; and the online mode for a more general execution scenario where interactive tasks with different time constraints or deadlines and non-interactive tasks may co-exist in the system. For each execution model, we propose scheduling algorithms and prove that they are effective both analytically