行政院國家科學委員會專題研究計畫 期中進度報告
兆級晶片系統前瞻技術研究--子計畫一:平台式系統晶片
之節能記憶體架構(2/3)
期中進度報告(精簡版)
計 畫 類 別 : 整合型 計 畫 編 號 : NSC 95-2221-E-002-360- 執 行 期 間 : 95 年 08 月 01 日至 96 年 07 月 31 日 執 行 單 位 : 國立臺灣大學資訊工程學系暨研究所 計 畫 主 持 人 : 楊佳玲 報 告 附 件 : 出席國際會議研究心得報告及發表論文 處 理 方 式 : 期中報告不提供公開查詢中 華 民 國 96 年 05 月 30 日
行政院國家科學委員會補助專題研究計畫成果報告
行政院國家科學委員會補助專題研究計畫成果報告
行政院國家科學委員會補助專題研究計畫成果報告
行政院國家科學委員會補助專題研究計畫成果報告
※※※※※※※※※※※※※※※※※※※※※※※※※※
※
平台式系統晶片之節能記憶體架構
※
※※※※※※※※※※※※※※※※※※※※※※※※※※
計畫類別:□個別型計畫 ■整合型計畫
計畫編號:NSC95-2221-E-002-360-
執行期間: 95 年 8 月 1 日至 96 年 7 月 31 日
計畫主持人:楊佳玲 國立台灣大學資訊工程學系副教授
計畫參與人員:陳依蓉,林仲祥,林業峻,陳彥名,李翰林
執行單位:國立台灣大學資訊工程學系
中 華 民 國 96 年 05 月 27 日
平台式系統晶片之節能記憶體架構
“Energy-Efficient Memory Hierarchy for Platform-based SoC” 計畫編號:NSC95-2221-E-002-360- 執行期間:95 年 8 月 1 日 至 96 年 7 月 31 日 主持人:楊佳玲 台灣大學資訊工程系副教授 一、 中文摘要 隨著製成技術的進步,漏電在單晶片系統 上造成之能源消耗的問題也越來越重要。在一 處理器中,快取記憶體所需之資源佔相當大部 份,因此,有許多針對快取記憶體以減少漏電 之機制被提出。然而,這些機制都會引起無法 預期之效能衰退,因此並不適用於需要絕對遵 守時間限制之硬性即時系統(hard real-time system)應用程式。在本計畫中,我們利用現有 之減少快取記憶體漏電之電路設計,提出第一 個適用於硬性即時系統之控制漏電機制。此考 量時間限制之減少快取記憶體漏電機制,利用 每個工作(task)之多餘時間(slack time)來決 定是否要將每個工作相對應之快取記憶體區塊 放入低漏電模式,並且保證每個工作可在其時 間限制內完成。實驗數據顯示,我們所提出之 漏電控制機制,與不管時間限制之漏電控制機 制相比,可達到幾乎相同之漏電減少量。 關鍵字: 漏電, 快取記憶體 ,硬性即時系統 英文摘要
Leakage energy consumption is an increasingly important issue as the technology continues to shrink. Since on-chip caches constitute a major portion of the processor’s transistor budget, several leakage reduction schemes have been proposed to reduce cache leakage. However, these schemes introduce performance unpredictability thereby not suitable
for hard real-time applications that require the timing constraint is met in all cases. In this paper, we propose the first approach to apply existing low leakage circuit techniques on hard real-time applications. The proposed timing-aware cache leakage control mechanism exploits task slack time to turn cache lines into the low-leakage state provided that the timing constraint is met. The experimental results show that the proposed control policy achieves comparable leakage reduction to the leakage control policy that aggressively turn cache lines into low leakage mode without considering the timing constraint.
Keywords: leakage, cache, hard real-time system 二、 計畫的緣由與目的
Power consumption is becoming a critical design issue of embedded systems due to the popularity of portable devices such as cellular phones and personal digital assistants. As the technology continues to shrink, leakage power is becoming a dominant factor to overall CPU energy [13]. Reducing leakage energy can be done by exploiting task idle time to shut down the CPU completely [4, 5, 9, 10] or individual micro- architecture component, for example, caches [7, 19] and branch predictors [11]. Previous works on applying shutting down techniques to hard real- time systems only focus on turning off a CPU completely [4, 5, 9, 10]. We are not aware of any
research work that applies micro-architectural leakage reduction techniques to hard real-time systems. This work is the first attempt to bridge this gap.
Since On-chip caches constitute a major portion of the processor’s transistor budget and account for a significant share of leakage, we target at reducing on-chip cache leakage in this project. In fact, leakage is projected to account for 70% of the cache power budget in 70nm technology [13]. Therefore, reducing cache leakage power consumption is important for reducing a processor’s total leakage. Two types of circuit techniques have been proposed to reduce cache leakage: Gated-Vdd [19] and drowsy caches [7]. The gated-Vdd technique turns off a cache line completely to save maximum leakage power, but the loss of state exposes the system to incorrect turn-off decisions which result in significant performance penalty. The drowsy cache technique uses a small supply voltage to retain the data in a memory cell at the low leakage state [7, 14]. Therefore, the drowsy cache technique reduces leakage less than the gated-Vdd technique, but it incurs much less penalty when accessing a memory cell at the low-leakage state. The delay to switch a memory cell from the low-leakage state to the active state is called wake-up overhead.
In this project, we propose the first timing-aware cache leakage control mechanism for hard real-time systems. To achieve energy savings with hard real-time guarantee, we exploit both static and dynamic slack to tolerate delay caused by accessing low-leakage cache lines. Unlike previous works that choose between the drowsy cache or gated-Vdd, our scheme allows joint use of both techniques. We exploit task-level information to manage cache lines of idle and
active tasks differently. For cache lines allocated to an active task, due to short idle period between accesses, only the drowsy cache technique is considered. These cache lines are turned into the drowsy mode periodically, and waken up when they are accessed. The period to turn all caches lines to the drowsy mode is referred to as the drowsy window size. A smaller drowsy window size leads to higher leakage savings at the cost of higher wake-up overheads. Our timing-aware cache leakage control mechanism chooses the smallest drowsy window size provided that the timing constraint is met. For cache lines allocated to idle tasks, we seek opportunities to turn cache lines off completely to get more leakage gain as long as the penalty of fetching data from the lower level memory hierarchy does not cause the violation of timing constraint.
We evaluate the proposed leakage control scheme on 8 real applications. The experimental results show that with tight deadlines, the simple policy in [7] causes high deadline miss ratio. (e.g., with 1% static slack1, the deadline miss ratio2 is up to 97.6%.) This confirms our assertion that existing leakage reduction techniques are not suitable for hard real-time applications, and a timing-aware leakage control scheme is a must. With 1% static slack, the proposed scheme has leakage reduction ranging from 78.4% to 86.9% with hard real-time guarantee, while the simple policy achieves leakage reduction from 89.7% to 90.6% with tasks missing deadlines. This shows the proposed scheme sacrifices leakage savings to satisfy the timing constraint. As task slack
1
Static slack = 1 –
∑
=
n
i 1Wi/Pi , where Wi and Pi are the
WCET and period of a task i among n tasks in a task set.
2
Deadline miss ratio = (Nmiss tasks /Ntotal task) , where Nmiss tasks is the number of tasks that missed deadline, and Ntotal tasksis the total number of executed tasks.
increased, the discrepancy of leakage savings between the proposed scheme and the simple policy decreases, and the leakage savings of the proposed method is approaching that of the simple policy. With 20% of static slack, our scheme achieves 1.3% more leakage savings than the simple policy. Joint use of drowsy caches and gated-Vdd also leads to more leakage savings. When the proposed scheme has opportunities to turn off the cache lines of a idle task, the proposed scheme achieves 2.8% more leakage reduction than the one with the drowsy caches only.
三、 研究方法及成果
In this section, we first introduce the system model we discussed in this project, and then we describe the proposed timing-aware leakage control policy.
1. System Model
The system consists of a task set of n periodic real time tasks. These tasks are independent tasks and preemptable. Tasks are denoted as T = {τ1, τ2…
τn}, where T denotes the task set andτi denotes
the i-th task of n tasks. Each τi has its own
period Pi and its WCET Wi. We assume a task’s deadline is its period. Tasks are scheduled using the EDF scheduling policy. A task with earlier deadlines gets higher priority. The scheduler has two queues: waiting queue (Qwaiting) and ready
queue (Qready). The waiting queue contains the
completed tasks, and the ready queue contains the running and preempted tasks. The task that is currently running is the active task, and the completed and preempted tasks are idle tasks. The schedualibitlity of a task set is tested by the CPU utilization U defined
∑
= = n i Wi Pi U 1 / . If U is lessthan 100%, the task set is said to be schedulable.
The baseline cache architecture that supports cache locking described in [11] is shown in Figure 1. The lock_ctrl signal indicates whether a cache line can be replaced or not. We select instructions to be locked in the instruction cache based on the locking algorithm described in [11]. Each cache line is associated with leakage mode bits to select the supply voltage. A cache line can be turned into either the drowsy caches or state-destructive mode (i.e. the gated-Vdd circuit). We use the terms drowsy mode and state-preserving mode interchangeably in this report.
Tag array Data array
Leakage mode D ec o d er D ec o d er D ec o d er D ec o d er tag index address = Leakage mode voltage controller voltage controller hit lock_ctrl
Figure 1. Baseline cache architecture of the proposed scheme.
2. Timing-aware leakage control
The objective of the proposed leakage management scheme is to determine the drowsy window size for active tasks and the leakage mode for idle tasks, provided that the timing constraint is not violated. The details of the proposed scheme are described as the follows.
2.1 Leakage Control Scheme for Active Tasks The leakage control scheme for active tasks is based on the Drowsy+Simple policy proposed in [7]. Different from Drowsy+Simple in [7] that uses fixed drowsy window size, the proposed leakage control scheme for active tasks adjust drowsy window size dynamically with hard real-time guarantee. The drowsy window size
affects the leakage savings and the performance overhead caused by waking-up drowsy cache lines. With a shorter window, cache lines are set to the drowsy mode more frequently thereby achieving higher leakage reduction. But it also causes higher wake-up overhead. As illustrated in Figure 2, to meet the timing constraint, the total wake-up overhead cannot exceed a task’s slack. Therefore, our leakage control scheme is to decide the smallest drowsy window size so as the timing constraint is met. That is, the wake-up overhead of all drowsy windows does not exceed the total slack time. The slack time of a task comes from two sources. One is called static slack that is computed based on the WCET. The other is called dynamic slack which is due to variations of task execution time. The leakage control scheme for active tasks contains off-line and on-line phases. Below we describe two phases in details.
Slack time Period With Leakage Control Without Leakage Control
Full speed execution
Wakeup overhead
Drowsy window
Figure 2. Illustration of using wake-up overehads to consume task slack time.
2.1.1 Off-line Phase Static Slack Allocation
We first allocate static slack to tasks statically based on their worst case preemption rates. According to the run-time slack reclamation algorithm described in the next section, the additional run-time slack of a low priority task is less likely to be used by other tasks. Therefore, to increase the total CPU utilization, we allocate static slacks to tasks with higher priorities.
Assume for all i; j, if i < j, then Pi < Pj . The number of preemption PN(τk) of a taskτk in the
worst case is
∑
− = = 1 1 / ) ( k i k i k P P PNτ
The static slack time, ρk, allocated to a taskτkis
∑
= − × − × = n i k k i k PN PN U P 11/ ( ) ) ( 1 ) 1 ( τ τ ρWorst Case Active Set Analysis
To estimate the performance overhead by activating drowsy cache lines in a drowsy window, we need to predict the number of cache access in a drowsy window. The number of cache lines that can be accessed in a drowsy window in the worst case is all the cache lines that could be accessed in the future. To obtain this information, we first construct the CFG (Control Flow Graph) of a program. In the CFG, each node represents a basic block, and an edge from node a to node b indicates that an execution path exits from basic block a to basic block b. L(B1)=3 B1 B2 L(B6)=4 B6 L(B7)=5 B7 AS(B2) = 7 AS(B7) = 5 AS(B6) = 4+5 = 9 AS(B1) = max(3+7 , 3+9) = 12
B1,B6,B7 : Normal basic block.
B2,B3,B4,B5 : Merged as one basic block since they are in a loop. L(Bi): number of locked cache lines touched by Bi.
AS(Bi): Active set size of basic block Bi.
AS(Bi) = max{L(Bi) + Active(Bj)} , where Bj is the child of Bi.
L(B2)=2
L(B3)=3 L(B4)=1
L(B5)=2 B3 B4
B5
Figure 3. Example of the CFG for the worst case active set analysis.
Figure 3 shows an example of the CFG, and the worst case active set (WCAS) analysis is performed on the CFG. In Figure 3, each node is
associated with L(Bi), which is the number of locked cache lines in basic block Bi. The WCAS size of each node Bi, which is denoted by AS(Bi), is the maximal number of locked cache lines that could be accessed from Bi. Therefore, AS(Bi) is calculated by
AS(Bi) = max{L(Bi) + AS(Bj)}; ∀B ∈j child(Bi)
WCAS analysis is performed at compile time. To convey the WCAS size to the cache controller, which performs the leakage control, we use a store instruction to write the WCAS size to the cache controller, and the cache controller triggers drowsy window resizing on receiving a WCAS size. To prevent frequent drowsy window resizing, we merge basic blocks of a loop into one, and insert the store instruction at the loop entry point. As shown in Figure 3, B2, B3, B4 and B5 form a loop, and the active set size information is recorded on
B2 only.
2.1.2 On-line Phase
Dynamic Slack Reclamation
Dynamic slack is from variations of task execution time, and the collection of the dynamic slack time is performed by the OS when a context switch occurs. The dynamic slack reclamation process is similar to the one proposed in [15]. Before we detail dynamic slack reclamation, we first define five notations:
UiCPU : the unused CPU budget of τi.
Wirem :the remaining WCET ofτi.
Si: the slack time ofτi.
Ei: the execution time ofτi.
DS: dynamic slack time
When a task arrives (i.e., removed from the waiting queue), UiCPU and
rem i
W are initialized to (WCET + static slack) and WCET, respectively. During the execution ofτi., UiCPU is consumed,
and rem i
W decreases. rem
i
W is updated by cache controller during task execution. Since the wake-up overhead of drowsy cache line does not estimated in WCET, at every cycle, Wirem is decreased by one when there is no drowsy cache hit. Note that we do not claim the slack time of preempted tasks as in [15]. In our scheme, a preempted task could utilize its slack to turn its cache lines into the low leakage mode during the idle period. Whenτi is preempted or completes,
we first consume the dynamic slack (DS) from unused CPU budget of the tasks in Qwaiting with
earlier deadlines. Then, we update UiCPU of task τi. DS is estimated by the following equation:
∑
∈ = waiting k Q CPU K U DS τIf DS is greater than Ei, UiCPU is not consumed. Otherwise, the CPU budget is updated using the following formula.
CPU i
U = UiCPU- (Ei - DS)
Therefore, the slack time that a task can use to compensate the wake-up overheads is:
Si = (UiCPU -Wirem) + DS
Drowsy Window Resizing
The process of drowsy window resizing is to decide the smallest drowsy window size such that the timing constraint is met. Drowsy window resizing is performed when a context switch occurs or when the active set changes. To decide the drowsy window size of the scheduled task, we have to find the smallest drowsy window size with the wake-up overhead that is not larger than the task’s available slack. Therefore, the drowsy window size is the smallest window size that satisfies the following inequality:
activei irem
i wsize S OH S
W / × ()× < ( 1 )
, where wsize denotes the window size, Sactive(i) denotes the WCAS size of task τi, and OH
denotes the number of cycles to wake up a drowsy cache line.
2.2 Leakage Control Scheme for Idle Tasks For idle tasks, we could turn their cache lines into the state-preserving or state-destructive mode depending on the length of the idle period and the slack time. The leakage control for idle tasks is performed by the OS when a context switch occurs. The slack Si and idle period Ii of a completed or preempted task are given below:
Completed tasks: Si = ρi Ii =Tarrive(τi) – Tenter_q(τi) Preempted tasks Si = UiCPU- Wirem Ii = BCET(τcurr)
, where BCET(τcurr) is the best case execution time
of the current active task, Tarrive(τi) is the next
arrival time ofτi, and Tenter_q(τi) is the timeτi
entering the waiting queue.
To decide the leakage mode of an idle task, we need to evaluate the performance overhead
(Poverhead(Mi)) and the energy overhead
(Eoverhead(Mi)) of a low leakage mode Mi, where
Mi is either the drowsy or state- destructive mode.
Poverhead(Mi) and Eoverhead(Mi) are:
Poverhead(Mi) = Nwake ×××× Dwake(Mi)
Eoverhead(Mi) = Nwake ×××× Ewake(Mi)
, where Nwakedenotes the number of times to wake
up cache lines in low leakage mode, and Dwake(Mi)
and Ewake(Mi) denote the delay and energy
overhead to wake up cache lines in low leakage mode Mi. For the state-preserving mode, the wake-up overhead is 2-cycle for putting both tag and data array into the drowsy mode, and the wake up energy is the energy required to charge a drowsy cache line from the drowsy state to the active state. For the state-destructive mode, the
wake-up overhead is the latency and energy to access the next level memory hierarchy. To turn an idle task’s cache lines into a low leakage mode Mi, the task must have
(1) Poverhead(Mi) ≦ Sidle, and
(2) Eoverhead(Mi) ≦ Eleak reduction(Mi)
, where Eleak reduction(Mi) denotes the leakage
reduction obtained by applying low leakage mode
Mi, and Eleak reduction(Mi) is derived from the
following formula:
Eleak reduction(Mi) =
(Eleak active(Mi) ¡ Eleak low(Mi)) × Iidle - Eoverhead(Mi)
, where Eleak active(Mi) and Eleak low(Mi) denote the
leakage energy of cache lines in the active and low leakage mode Mi, respectively. Iidle is the idle
length of the idle task.
To determine the leakage mode of idle tasks, we evaluate the performance overhead and leakage reduction achieved by both the gated-Vdd and drowsy cache circuits. We choose the low-leakage mode with the most leakage reduction while meeting the timing constraint as the leakage mode of an idle task.
3. Experimental Results
For cache leakage evaluation, we use the HotLeakage tool set [23]. HotLeakge is developed based on the Wattch [3] tool set. HotLeakage explicitly models the effects of temperature, voltage, and parameter variations, and has the ability to recalculate leakage currents dynamically as temperature and voltage changed at runtime due to operating conditions. To simulate multi-tasking workloads, we modified HotLeakage to allow multiple programs executing simultaneously. We also implement the EDF scheduler. In our experiment, cache locking is performed on L1 I-cache. Since we also put the tags into the drowsy
mode, the performance overhead of accessing a drowsy line is set to 2 cycles according to [16].
Table 1. Simulated architecture parameters. Processor Core
Instruction window 16-RUU, 16-LSQ
Issue width 1 instruction per cycle, in-order issue
Functional units 4 IntALU, 1 IntMult Div, 1 FPALU, 1 FP Mult Div Memory Hierarchy
L1 I-cache Size 8KB, 2-way, 16B block size
L2 cache Size 32KB, 4-way, 32B block size, 8-cycle access latency Memory 8-cycle access latency
Energy Parameter Processor technology 0.07nm
Supply voltage 0.9V
Temperature 353
Table 2. Task set characterization.
Name Description Code size(byte) WCET(cycles)
Small task set (Total code size 7608 bytes)
Jfdctint JPEG integer implementation of the forward DCT 3296 19087
Crc cyclic redundancy code example program 1400 142088
Ludcmp Linear equations by LU decomposition 2336 16607
Matmult Matrix multiplication 576 12555
Medium task set (Total code size 9192 bytes)
Qurt Computation of roots of quadradic equations 1200 4038
Minver Matrix inversion 3656 11281
Jfdctint JPEG integer implementation of the forward DCT 3296 18969
Fft1 FFT Cooly-Turkey algorithm 1040 8685
The detailed processor and memory hierarchy parameters are shown in Table 1. We implement
two leakage control mechanisms, the
Drowsy+Simple scheme proposed in [7], and the proposed timing-aware leakage control scheme (TALC). For the Drowsy+Simple scheme, we determined the drowsy window size through exhaustive simulations and chose the best one on the average, 1000-cycle [7]. The cache lines allocated to idle tasks are turned into the drowsy
mode immediately when a context switch occurs. The benchmarks used in this work are from the SNU real-time benchmark suite [1]. The benchmark programs are C sources which are collected from numerical calculation programs and DSP algorithms. We mix multiple applications together to form two multi-tasking workloads, the small task set and the medium task set. Details of the workloads are listed in Table 2. The WCET of each task is measured with cache locking. To
generate varying execution time, we use the method similar to [8]. We assume the BCET of a task as a percentage of its WCET. In our experiments, the (BCET/WCET) ratio is set to 0.95. The execution time of each task instance is generated by a normal distribution with mean μ = (WCET + BCET)=2 and standard deviation ρ = (WCET ¡ BCET)=6. The task instance is forced to terminate once its execution time is expired.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1% 2% 3% 4% 5% Static slack M is s ra ti o
Small task set Medium task set
Figure 4. Deadline miss ratio of Drowsy+Simple.
We first show the deadline miss ratio of the Drowsy+Simple scheme to demonstrate the importance of designing a timing-aware leakage control algorithm. We adjust the period of each task to achieve 1%, 2%, 3% , 4% and 5% static slack. Figure 4 shows the ratio of tasks missing deadlines with different static slack. For the small task set, the miss ratio is 86.3% and 0.4% when the static slack is 1% and 2%, respectively. For the medium task set, the miss ratio is up to 97.9% and 95.6% when the static slack is 1% and 2%, respectively. Drowsy+Simple has higher miss ratio in the medium task set than in the small task set. The medium task set has larger total code size and has more instructions locked in the cache than those of the small task set. Therefore, the Drowsy+Simple scheme incurs more performance degradation in the medium task set than in the small task set. Although Drowsy+Simple only
misses the deadline in the cases with a tight schedule, this is still not acceptable for a hard real-time system that requires the system to always meet the timing constraint. This confirms our assertion that existing leakage reduction techniques are not suitable for hard real-time applications. Our timing-aware leakage control algorithm is guaranteed to meet the timing constraints, and the miss ratio is zero in all cases.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1% 5% 10% 15% 20% Static slack L ea k ag e sa v in g s
Drowsy+Simple, small TALC, small Drowsy+Simple, medium TALC, medium
Figure 5. Evaluation of leakage reduction.
Figure 5 compares the energy savings achieved by our TALC scheme vs. the Drowsy+Simple mechanism with 1%, 5%, 10%, 15% and 20% static slack. Note that for fair comparison, in this set of experiments, the TALC scheme turns the cache lines of idle tasks into the drowsy mode only. We show the experimental results for the small and medium task sets separately. When the static slack is 1% where Drowsy+Simple has 86.3% and 97.6% of tasks missing their deadlines with the small and the medium task set, in order to satisfy the timing constraint, the TALC scheme achieves less energy savings than drowsy+Simple. From Figure 5, we also observe that TALC achieves less leakage reduction with the medium task set than the small task set. Since TALC assumes the worse case
active set for drowsy window resizing, it could overestimate the wake-up delay. For the medium task set, the overestimation is more serious than the small one since the medium task has larger code size and longer worst-case execution path. A more precise active set analysis scheme could help alleviate this problem. We leave this as the future work. As slack time increased, the energy savings achieved by TALC approaches Drowsy+Simple. With 20% static slack, the proposed scheme has 1.1% and 1.3% more leakage savings that Drowsy+Simple for the small and medium task set, respectively. This energy advantage provided by TALC over Drowsy+Simple comes from run-time drowsy window resizing. With 20% static slack for the small task set, the window size ranges from 13 cycles to 979 cycles while Drowsy+Simple fixed the window size to 1000-cycle.
Table 3. Leakage savings of TALC-drowsy and TALC-dual. Static slack TALC-drowsy TALC-dual Differences
20% 90.9% 93.3% 2.4%
30% 91.7% 94.2% 2.5%
40% 92.9% 95.6% 2.7%
50% 93.9% 96.6% 2.7%
60% 94.2% 97.0% 2.8%
To evaluate the effect of turning off cache lines of idle tasks completely, we create a new task set that has idle periods long enough for the state-destructive mode. To lengthen the idle period, we can increase both static and dynamic slack. To increase static slack, we set 20%, 30%, 40%, 50% and 60% static slack in this set of experiments. To increase dynamic slack, we prolong a task’s WCET by increasing the number of iterations executed by the task’s major subroutines on the worst-case execution path. The BCET/WCET ratio remains 0.95 as the original setup, and the a task’s
dynamic slack increases with its WCET prolonged. The experimental results of this new task set are shown in Table 3. In Table 3, TALC-drowsy denotes the TALC scheme with the drowsy mode only, and TALC-dual denotes the TALC scheme
with both the drowsy mode and the
state-destructive mode. The results show that turning off cache lines of an idle task completely achieves up to 2.8% more leakage saving than that of TALC-drowsy.
四、結論
In this project, we propose a timing-aware cache leakage control scheme for hard real-time system. The basic idea of the proposed algorithm is to consume system slack by the performance overhead caused by activating the drowsy cache lines. The proposed scheme manages cache lines of active and idle tasks differently. The objective of the proposed leakage management method is to determine the drowsy window size for the active task, and the leakage mode for the idle task provided that the timing constraints is not violated. Experimental results show that, although our scheme achieves less leakage savings than Drowsy+Simple with tight schedule, our scheme provides the timing constraint is met in all cases while Drowsy+Simple has tasks miss deadlines. With task slack increased, the discrepancy between leakage savings of our scheme and Drowsy+ Simple decreases. With 20% static slack, our scheme even achieves 1.3% more leakage savings than Drowsy+Simple. This energy advantage provided by the proposed scheme comes from run-time drowsy window resizing. With the task set that has opportunities to put cache lines into state-destructive mode for idle tasks, the proposed scheme achieves 2.8% more leakage savings than
the proposed scheme with the drowsy mode only.
This research work has been submitted to the 2007 International Conference on Compilers, architecture and Synthesis for Embedded Systems (CASES 2007).
五、參考文獻
1. Snu real-time benchmarks. In
http://archi.snu.ac.kr/realtime/benchmark/index.html.
2. ARM946E-S.
http://www.samsung.com/products/semiconductor/asic/i pcorelibrary/intellectureproperties/processorcores/armco res/ddi0201 a946es.pdf.
3. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th annual
international symposium on Computer architecture (ISCA’00), 2000.
4. J.-J. Chen, H.-R. Hsu, and T.-W. Kuo. Leakage-aware energy-efficient scheduling of real-time tasks in multiprocessor systems. In Proc. the 12th IEEE
Real-Time and Embedded Technology and Applications Symposiums (RTAS ’06), 2006.
5. J.-J. Chen and T.-W. Kuo. Procrastination for leakage-aware rate-monotonic scheduling on a dynamic voltage scaling processor. In Proc. of Conference on
Languages, Compilers, and Tools for Embedded Systems 2006(LCTES ’06), 2006.
6. A. Cortex-R4F. http://www.arm.com/pdfs/cortex-r4f
7. K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of the 29th annual
international symposium on Computer architecture 2002(ISCA’ 02), 2002.
8. R. Jejurikar and R. Gupta. Integrating preemption threshold scheduling and dynamic voltage scaling for energy efficient real-time systems. In RTCSA, 2004. 9. R. Jejurikar and R. Gupta. Dyanmic slack reclamation
with procrastination scheduling in real-time embedded systems. In Proceedings of the 42nd annual conference
on Design automation, 2005.
10. R. Jejurikar, C. Pereira, and R. Gupta. Leakage aware dynamic voltage scaling for real-time embedded systems. In Proc. the 41st Design Automation Conference
(DAC ’04), 2004.
11. P. Juang, K. Skadron, M. Martonosi, Z. Hu, D. W. Clark, P. W. Diodato, and S. Kaxiras. Implementing branch-predictor decay using quasi-static memory cells.
ACM Transactions on Architecture and Code Optimization (TACO), 1.
12. S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th annual
international symposium on Computer architecture 2001(ISCA’ 01), 2001.
13. N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin, M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static power. IEEE
Computer, 36. 16
14. N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge. Drowsy instruction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Micro-35, 2002.
15. W. Kim, J. Kim, and S. Min. A dynamic voltage scaling algorithm for dynamic-priority hard real-time systems using slack time analysis. In Proceedings of the
conference on Design, automation and test in Europe (DATE ’02), 2002.
16. Y. Li, D. Parikh, and Y. Zhang. State-preserving vs. non-state-preserving leakage control in caches. In
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, 2004.
17. S. Martin, K. Flautner, T. Mudge, and D. Blaauw. Combined dynamic voltage scaling and adaptive body biasing for lower power microprocessor under dynamic workloads. In ICCAD, 2002.
18. L. Niu and G. Quan. Reducing both dynamic and leakage energy consumption for hard real-time systems. In Proceedings of the 2004 international conference on
Compilers, architecture, and synthesis for embedded systems (CASES ’04), 2004.
19. M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Gated-vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In
Proceedings of the 2000 International Symposium on Low Power Electronics and Design (ISLPED00), 2000. 20. I. Puaut and D. Decotigny. Low-complexity algorithms for static cache locking in multitasking hard realtime systems. In Proceedings of the 23rd IEEE REAL-TIME
SYSTEMS SYMPOSIUM (RTSS02), 2002.
21. S.-H. Yang, B. Falsafi, M. D. Powell, K. Roy, and T. N.
Vijaykumar. An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance i-caches. In Proceedings of the 7th
International Symposium on High-Performance Computer Architecture (HPCA ’01), 2001.
22. W. Zhang and J. S. Hu. Compiler-directed instruction cache leakage optimization. In Proc. the 35th Annual
International Symposium on Microarchitecture (MICRO ’02), 2002.
and M. Stan. Hotleakage: A temperature-aware model of subthreshold and gate leakage for architects.
出席國際學術會議報告
報 告 人 姓 名 楊佳玲 服 務 機 構 及 職 稱 台 灣 大 學 資 工 系 副教授 會 議 時 間 地 點 Nice, France April 16 – 20, 2007會議名稱 Design Automation & Test in Europe
發表論文題目 Energy-Efficient Real-Time Task Scheduling with Task Rejection
一、參加會議經過
2007 Design Automaton and Test in Europe 於 2007/4/16 ~ 2007/4/20 於法國尼斯舉 行。Date 乃為歐洲最大之 EDA (Electronics and Design Automation)會議, 與會人數眾 多, 此次會議共含 11 session 及 5 tutorials 。本人於 Session 11: Real-Time Methodology 發表論文 “Energy-Efficient Real-Time Task Scheduling with Task Rejection” ,此篇論文 之發表,於會中廣獲好評,於 paper presentation 之後,與多位學者進行深入之討論。 本 人除參與論文發表外,並和與會研究人員進行意見交流,獲益良多。
二、與會心得
1. System-level design 仍為當今SOC design methodology 研究上一主要課
題。此次會議的兩個 keynotes 皆指出其重要性。
2. Process variation 是 nano-technology 下 一重要設計要素 ,此次會議中有多
篇論文討論 process variation 對architectural design 之影響。
3. System-wide power management 是未來 ubiquitous communication device
三、攜回資料名稱之內容
2007 Date 會議論文集光碟片一片。
四、結語
非常感謝國科會提供補助,使得我得以成行。也使得我們有機會與國
外同領域的學者交換 low power embedded system design發展及研究的心
得。
Energy-Efficient Real-Time Task Scheduling with Task Rejection
∗Jian-Jia Chen, Tei-Wei Kuo, Chia-Lin Yang Department of Computer Science and
Information Engineering
National Taiwan University Taipei, Taiwan.
Email:{r90079, ktw, yangc}@csie.ntu.edu.tw
Ku-Jei King xSeries Development
IBM Systems Technology Group (STG) Email: [email protected]
Abstract
In the past decade, energy-efficiency has been an important system design issue in both hardware and software managements. For mo-bile applications with critical missions, both energy consumption reduction and timing guarantee have to be provided by system en-gineers to extend operation duration and maintain system stability. This research explores real-time systems composed of homogeneous multiple processors with the capability of dynamic voltage scaling (DVS), in which a given task can be rejected with a specified value of rejection penalty. The objective is to minimize the summation of the total rejection penalty for the tasks that are not completed in time and the energy consumption of the system. This study provides analysis to show that there does not exist any polynomial-time ap-proximation algorithm for the studied problem, unlessP = N P. Moreover, we propose algorithms for systems with ideal and non-ideal DVS processors. The capability of the proposed algorithms is provided with extensive evaluations. The evaluation results reveal that our proposed algorithms could derive effective solutions of the energy-efficient scheduling problem with task rejection considera-tions.
Keywords: Energy-Efficient Scheduling, Task Rejection,
Real-Time Task Scheduling.
1. Introduction
Along with the low-power demands in electronic circuit designs, a modern processor can now operate at different supply voltages to balance its power consumption and performance. Different supply voltages lead to different execution speeds on a dynamic voltage scaling (DVS) processor. Well-known DVS processors for embed-ded systems are Intel StrongARM SA1100 processor [17] and Intel XScale [18]. Moreover, technologies, such as Intel SpeedStepR and
AMD PowerNOW!TM
, provide dynamic voltage scaling for laptops to prolong the battery lifetime.
In the past decade, energy-efficient designs have received a lot of attention in industry and academics. For systems with real-time demands, energy-efficient task scheduling has been studied to min-imize the energy consumption with timing guarantee, especially for uniprocessor systems with DVS supports. Due to the convexity of the power consumption function, implementations in multiproces-sor systems are often more energy-efficient [2]. Moreover, since many chip makers, such as Intel and AMD, are releasing multi-core chips, multiprocessor energy-efficient scheduling is becoming more and more important. Various heuristics were proposed for energy consumption minimization under different task models in multipro-cessor environments, e.g., [1, 4–7, 15, 19] for independent real-time tasks and [9, 20] for real-time tasks with precedence constraints.
Due to the increase of leakage power consumption in technology, researchers have started exploring energy-efficient scheduling with ∗Support in parts by research grants from ROC National Science
Coun-cil NSC-95-2752-E-002-008-PAE, Aim for Top University Plan 95R0062-A100-07, and IBM Faculty Award.
the considerations of the non-negligible power consumption of leak-age current [12]. For uniprocessor scheduling, Irani et al. [10] pro-posed approximation algorithms for aperiodic real-time tasks. For periodic real-time tasks in uniprocessor systems, Jejurikar et al. [12], Lee et al. [14], and Chen et al. [8] provided scheduling algorithms with task procrastination to decide when to turn the processor into a dormant mode. Moreover, Chen et al. [6] developed approximation algorithms for multiprocessor leakage-aware scheduling.
However, most studies for energy-efficient real-time task schedul-ing do not take task rejection into considerations. Most heuristics for multiprocessor energy-efficient scheduling cannot guarantee the schedulability of the derived schedules. Chen et al. [6] applied the constraint violation approach to augment the highest available speed with a4
3 factor. However, resource augmentation might not be
pos-sible since it is hardware-dependent. Hence, some tasks might be rejected to guarantee the schedulability of the selected tasks.
This research explores systems with the possibility to reject a task for execution with a specified cost (penalty). If a task is more important than another, its rejection penalty should be specified with a greater value. We consider a homogeneous multiprocessor system with continuously available speeds or discretely available speeds. The objective is to minimize the summation of the total rejection cost for the tasks that are not completed in time and the energy consumption of the system. The contribution of this paper is on two folds. Firstly, we show theN P-hardness of the studied problem, and provide analysis on the non-existence of polynomial-time approximation algorithms, provided thatP = N P. Secondly, we propose a branch-and-bound approach and heuristic algorithms. The proposed algorithms are evaluated by extensive experiments. The evaluation results reveal that our proposed algorithms could derive effective solutions of the energy-efficient scheduling problem with task rejection considerations.
The rest of this paper is organized as follows: Section 2 defines the energy-efficient task scheduling problem with task rejection and provides the hardness analysis. Section 3 presents our algorithms. Experimental results for the performance evaluation of the proposed algorithms are presented in Section 4. Section 5 is the conclusion.
2. Problem Definition and Hardness Analysis
Processor models This paper explores energy-efficient scheduling onM homogeneous DVS multiprocessors, where the power con-sumption function of each task is the same on every processor. The power consumption functionP (s) of the adopted processor speed on a DVS processor can be divided into two partsPd(s) and Pind, in whichPd(s) is dependent (Pindis independent, respectively) upon the processor speeds [21]. The speed-dependent power tion function is mainly contributed by the dynamic power consump-tion resulting from the charging or discharging of CMOS gates and the short-circuit power consumption, while the leakage power sumption contributes the major of the speed-independent power con-sumption. The algorithms proposed in this paper can be adopted with many power consumption function formulations, such as those in
[16,§5.5]. We consider systems with Pd(s) as a convex and increas-ing function, e.g.,Pd(s) ∝ sαfor anyα > 1.
The number of CPU cycles executed in a time interval is linear of the processor speed. That is, the number of CPU cycles completed in time interval (t1, t2]is
Rt2
t1 s(t)dt, where s(t) is the processor
speed at timet. The energy consumed in (t1, t2]is
Rt2
t1 P (s(t))dt.
We first target ideal processors, in which a processor may operate at any speed in [Smin, Smax]. We also show the extension to cope
with non-ideal processors with discrete speeds. For non-ideal pro-cessors, there areH available speeds indexed by s1, s2, . . . , sHin an increasing order. For non-ideal processors, for brevity,sH+1and
P (sH+1)are both assumed∞, Sminiss1, andSmaxissH. When needed, turning the processor into a dormant mode (or turning the processor off) might further reduce the energy consump-tion. However, turning off or waking up a processor takes time and has energy overheads. For processors with non-negligible overheads to be turned off, the overheads could be treated as part of the over-heads to turn on the processor [6, 10]. We denote Esw (tsw, re-spectively) as the energy (the time, rere-spectively) requirement of the
switching overheads for the whole process on turning off the
proces-sor and then turning on the procesproces-sor.
Task models Tasks considered in this paper are periodic and inde-pendent in execution. A periodic task is an infinite sequence of task instances, referred to as jobs, where each job of a task comes in a regular period. Each taskτiis associated with its initial arrival time (denoted asai), its computation requirement in CPU cycles (denoted asci), and its period (denoted aspi). The relative deadline of each taskτiis equal to its periodpi. That is, the arrival time and dead-line of thej-th job of task τiareai+ (j − 1) · piandai+j · pi, respectively. We assume that all the tasks arrive at time 0, but ex-tensions can be achieved easily for tasks with different arrival times. Given a task setT, the hyper-period of T, denoted by L, is defined as the minimumL so that L/piis an integer for any taskτiinT. For example,L is the least common multiple (LCM) of the periods of tasks inT when the periods of tasks are all integers. Without loss of generality, we only consider tasksτis with cpii ≤ Smax, since it is
not possible to complete any taskτjwithpcjj > Smaxin time.
This research explores systems with the possibility to reject a task for execution with a specified cost (penalty) provided by system designers. If a task is more important than another, its rejection cost should be specified with a greater value. If a task instance of task
τiis not completed in time, the system receivesχi penalty, where
χi> 0. (If a task can be rejected without penalty, we can reject the task directly.) If a task is very important and cannot be rejected, its rejection cost should be specified as∞. If the rejection costs of all the tasks are infinite, all the tasks are asked to be completed in time.
Problem definition This paper explores the problem on the min-imization of the energy consumption of the system and the rejec-tion cost at the same time. We pursue the objective on the linear combination of the energy consumption and the rejection cost, i.e., (1− α)E + αΠ, where α is a non-negative factor no more than 1 specified by the system designer, E is the energy consumption of the system in the hyper-period, and Π is the total rejection penalty of the task instances missing their deadlines in the hyper-period. If energy consumption minimization is more important than task rejec-tion penalty minimizarejec-tion,α should be specified as close to 0, and vice versa.
For notational brevity, we normalize the rejection penalty of task
τiasαχi, the power consumption functionP () as (1 − α)P (), the energy switching overheads as (1− α)Esw. Hence, the objective of the linear combination can be treated as the summation of the (normalized) penalty and the (normalized) energy consumption.
The problem explored in this paper is defined as follows: DEFINITION1. Energy-eFFicient schEduling with rejeCting Tasks
(EFFECT):
Consider a task setT of N independent tasks over M identical processors with a common power consumption functionP (s). Each periodic taskτi∈ T arrives at time 0 and is associated with a
com-putation requirement inciCPU-cycles, a rejection cost (penalty)χi,
and a periodpi, where the relative deadline of taskτiispi. The
en-ergy consumption and timing of the switching overheads areEsw
andtsw, respectively. The problem is to derive a schedule ofT to
minimize the summation of the penalty (cost) of the task instances that miss their deadlines and the energy consumption of the system in the hyper-periodL of tasks in T, in which a job of task τi is
executed entirely on a processor.
For brevity, for the rest of this paper, the objective function of the
EFFECTproblem is called as energy-penalty (EP for abbreviation).
Hardness analysis Since most previous studies on multiprocessor energy-efficient scheduling did not take task rejection penalty into considerations, the schedulability of the derived schedules cannot be guaranteed, e.g., [4, 9]. As shown in [6], it is N P-hard to derive a schedule with the minimum energy consumption to complete all the tasks in time without rejecting any real-time task. The following lemma shows that theEFFECTproblem is stillN P-hard even if we have the flexibility to reject some tasks for execution.
LEMMA1. TheEFFECTproblem isN P-hard in a strong sense even whenEswis 0, and all the tasks have the same rejection penalty.
Proof. It can be proved by a reduction from the leakage-aware
multiprocessor energy-efficient rejection problem [6] with the same periodp. The rejection cost of each task is a constant greater than
P (Smax)· p. The detail is omitted due to space limitation.
Due to theN P-hardness of theEFFECTproblem, polynomial-time approximation algorithms might be pursued for the provision of approximated solutions with worst-case guarantees. A polynomial-timeβ-approximation algorithm for theEFFECTproblem must have polynomial-time complexity of the input size and could derive a solution with an objective value at most β times of an optimal solution, for any input instance. However, in addition to theN P-hardness of theEFFECTproblem, the following theorem shows the hardness on the approximability of polynomial-time algorithms. THEOREM1. There does not exist any polynomial-time
approxima-tion algorithm for theEFFECTproblem unlessP = N P.
Proof. This theorem can be proved by a gap reduction from the
N P-complete PARTITIONproblem: Given a set ofN non-negative numbers, denoted byo1, o2, . . . , oN, the PARTITIONproblem is to answer whether there is a partition of theseN numbers into two sets, so that the sum of the numbers in each set is the same. Suppose for contradiction that there is a polynomial-time (1 +)-approximation algorithm, denoted by AlgorithmA, with > 0 for the EFFECT
problem. We will show that we can use Algorithm A to answer the PARTITIONproblem in polynomial time, which contradicts the assumption onP = N P.
To solve the PARTITIONproblem by applying AlgorithmA, we have to create an input instance for theEFFECTproblem. For each numberoi, a unique taskτiis created withciasoi,pias
PN j=1oj 2 , and χi as (1 +)( PN j=1oj), where P (s) = s 3 and E sw = 0.
Moreover, Smax is 1, and Smin is no more than 1. If the input
instance of the PARTITIONproblem admits a positive answer, the optimal solution for the constructed input instance isPNj=1oj. By the construction, there exists no feasible solution with EP more than PN
j=1oj and no more than (1 +) PN
j=1oj. Since AlgorithmA is a (1 +)-approximation algorithm, Algorithm A guarantees to derive a solution whose EP isPNj=1oj. If the input instance of the PARTITIONproblem does not admit a positive answer, the solution answered by AlgorithmA must be greater thanPNj=1oj.
Since the construction of the input instance of theEFFECT prob-lem takesO(N) time, and Algorithm A is with polynomial-time
complexity, we can determine whether an input instance of the PAR
-TITIONproblem admits a positive answer in polynomial time by ver-ifying the solution of AlgorithmA, which is a contradiction.
3. Our Algorithms
By Theorem 1, it is impossible to derive optimal solutions or ap-proximated solutions with worst-case guarantee for the EFFECT
problem in polynomial time, unless P = N P. This section pro-vides a branch-and-bound approach and heuristics to derive solu-tions. We first partition tasks into M + 1 task sets, denoted by
T1, T2, . . . , TM, TM+1, so that the tasks in task setTmare exe-cuted on them-th processor for m ≤ M and the tasks in TM+1are rejected. The off-line derivation is obtained by assuming negligible switching overheads. Whether a rejected task instance determined in the off-line phase can be executed for performance improvement is done in an on-line fashion.
If a task has high computation requirement but low rejection penalty, it should be a good candidate to be rejected to reduce the EP, and vice versa. For the rest of this section, tasks are sorted non-increasingly according to χi
ci. We will consider the execution
or rejection of tasks in the sorted order. Moreover, throughout this section, the earliest-deadline-first (EDF) schedule will be applied for task scheduling on each processor. By [3], a task set Tm is schedulable on a processor if and only ifPτ
i∈Tm ci
pi ≤ Smax.
3.1 Off-line derivation of task partitions with negligible switching overheads
Although the power consumption function P (s) is a convex and increasing function, the energy consumption at speed s, which is
P (s)
s , might be not. For example, ifP (s) = s
3+γ, P (s)
s is a de-creasing function fors in (0,p3 γ
2]and an increasing function for
s in (p3 γ
2, Smax]. If the switching overheads are negligible, there
is a lower-bounded execution speed for tasks, referred to as the
critical speed s∗as in [6, 8, 12]. For ideal processors, the critical speed s∗ can be derived by solving d(P (sds∗∗)/s∗) = 0[6]. By the
definition, ifs∗ is greater thanSmin, the critical speed s∗ is
re-vised asSmin. If s∗ > Smax, s∗ isSmax. For non-ideal
proces-sors, the critical speeds∗isshwithP (sh+1)/sh+1 > P (sh)/sh andP (sh−1)/sh−1 ≥ P (sh)/shforh = 1, 2, . . . , H by taking
P (s0)/s0andP (sH+1)/sH+1as∞ for boundary checking. For clarity, we first focus on systems with ideal processors. The extensions to systems with non-ideal processors will be shown by the end of this subsection. A task partition is said a feasible solution if all the selected tasks for execution can meet their deadlines.
3.1.1 A branch-and-bound approach for ideal processors
For a given task partition (T∗1, T∗2, . . . , TM∗ , T∗M+1)withm de-fined asPτ
i∈T∗m ci
pi. Ifm ≤ Smax for all m = 1, 2, . . . , M,
the earliest-deadline-first (EDF) schedule on each processor by executing all the tasks in Tm at speed min{s∗, m} can make all the tasks in T∗m complete in time with the minimum energy consumption for the task partition [3]. Therefore, we can apply the depth-first search in a search tree to obtain the task parti-tion (T∗1, T∗2, . . . , T∗M, T∗M+1)with the minimum EP inO((N +
M)NM+1)time.
The branch-and-bound (BB) approach can be adopted to reduce the time complexity on exploration of the solution space. Since homogeneous multiprocessor systems are under considerations, we can restrictedτ1to be executed on the first processor by symmetry
or to be rejected. In our BB approach, we visit the search tree rooted fromτ1, and thek-th level represents the selection of task τkto a task setTmwithm = 1, 2, . . . , M, M + 1.
Suppose that we are at then-th level in the search tree. The basic pruning condition is on the schedulability test. Ifcn
pn+
P
τi∈Tm
ci pi
is greater than Smax, the BB approach can eliminate all subsets
containing the infeasible subset. The lower-bounded elimination is
Algorithm 1 : LEP Input:T†, T, n; 1: T← {τ i| n < i ≤ N}; 2: yi← 0, ∀τi∈ T,U1←Pτi∈T† ci pi; 3: for (i ← n + 1; i ≤ N; i ← i + 1) do
4: Let yi be the value between 0 and 1 which minimizes
P∗(cipiyi+U1 M )M + (1 − yi) χi pi with ci piyi+ U1≤ M · Smax; 5: if (yi< 1) then 6: returnL · (P∗( ci piyi+U1 M )M + (1 − yi) χi pi + P τj∈T χj pj + PN j=i+1 χj pj); 7: else 8: U1← U1+cpii; 9: returnL · (P∗(UM1)M +Pτ j∈T χj pj) ; Algorithm 2 : BB Procedure: DFSBB(n, X)
Input:n, X, where Xiis an integer between1 and M + 1 for i < n;
1: form ← 1; m ≤ M + 1; m ← m + 1 do 2: ifm ≤ M andcpn n+ P i:1≤i≤n−1andXiism ci pi > Smaxthen 3: continue; 4: Xn← m; 5: ifn is equal to N then
6: evaluate the EP by executingτiat theXi-th processor withXi≤
M and rejecting task τis withXi= M + 1;
7: save this task partition if the EP is better than the best solution so far; 8: else 9: T†← {τi| 1 ≤ i ≤ n and Xi≤ M}; 10: T← {τ i| 1 ≤ i ≤ n and τi∈ T/ †}; 11: EPm←LEP(T†, T, n);
12: ifEPmis greater than the best solution so far then
13: continue; 14: else
15: call DFSBB(n + 1, X)
Procedure: BB()
1: sort tasks inT non-increasingly according toχi ci;
2: initializeX with Xi← M + 1, for i = 1, 2, . . . , N;
3: call DFSBB(1, X) to obtain the task partition;
applied by verifying whether the lower bound of the EP of the feasible solutions for the subsets of solutions rooted at then-th level is lower than the best solution derived so far. If the lower bound is greater than the best solution derived so far, we can prune all the subsets rooted at then-th level. For a specified partition of set
{τi | 1 ≤ i ≤ n} into two disjoint sets T†andTby rejecting all the tasks inTand executing all the tasks inT†, AlgorithmLEP, shown in Algorithm 1, can be applied to calculate a lower bound of the EP of feasible solutions, whereP∗(s) in Steps 4, 6, and 9 is
P∗(s) = P (s), whens > s∗, and s s∗P (s ∗), otherwise. (1)
The proof for the correctness on the provision of the lower-bounded EP of AlgorithmLEPis omitted due to space limitation.
The branch-and-bound approach is presented in Procedure DFSBB in Algorithm 2, in which the search space is pruned with the feasi-bility test in Step 2 and Step 3 and the lower-bounded elimination between Step 9 and Step 13. The solution in this phase is obtained by calling DFSBB(1,X) with initialization shown in Procedure BB in Algorithm 2.
3.1.2 Polynomial-time algorithms for ideal processors
This section presents efficient algorithms, i.e., in polynomial time, for the determination of the task partition. The rationale behind the proposed algorithms is to select tasks with higher χi