兆級晶片系統前瞻技術研究－子計畫一：平台式系統晶片之節能記憶體架構(2/3)

(1)

行政院國家科學委員會專題研究計畫期中進度報告

兆級晶片系統前瞻技術研究--子計畫一：平台式系統晶片

之節能記憶體架構(2/3)

期中進度報告(精簡版)

計畫類別：整合型計畫編號： NSC 95-2221-E-002-360- 執行期間： 95 年 08 月 01 日至 96 年 07 月 31 日執行單位：國立臺灣大學資訊工程學系暨研究所計畫主持人：楊佳玲報告附件：出席國際會議研究心得報告及發表論文處理方式：期中報告不提供公開查詢

中華民國 96 年 05 月 30 日

(2)

行政院國家科學委員會補助專題研究計畫成果報告

※※※※※※※※※※※※※※※※※※※※※※※※※※

※

平台式系統晶片之節能記憶體架構

※

※※※※※※※※※※※※※※※※※※※※※※※※※※

計畫類別：□個別型計畫 ■整合型計畫

計畫編號：NSC95－2221－E－002－360－

執行期間： 95 年 8 月 1 日至 96 年 7 月 31 日

計畫主持人：楊佳玲國立台灣大學資訊工程學系副教授

計畫參與人員：陳依蓉，林仲祥，林業峻，陳彥名，李翰林

執行單位：國立台灣大學資訊工程學系

中華民國 96 年 05 月 27 日

(3)

平台式系統晶片之節能記憶體架構

“Energy-Efficient Memory Hierarchy for Platform-based SoC” 計畫編號：NSC95-2221-E-002-360- 執行期間：95 年 8 月 1 日至 96 年 7 月 31 日主持人：楊佳玲台灣大學資訊工程系副教授一、中文摘要隨著製成技術的進步，漏電在單晶片系統上造成之能源消耗的問題也越來越重要。在一處理器中，快取記憶體所需之資源佔相當大部份，因此，有許多針對快取記憶體以減少漏電之機制被提出。然而，這些機制都會引起無法預期之效能衰退，因此並不適用於需要絕對遵守時間限制之硬性即時系統(hard real-time system)應用程式。在本計畫中，我們利用現有之減少快取記憶體漏電之電路設計，提出第一個適用於硬性即時系統之控制漏電機制。此考量時間限制之減少快取記憶體漏電機制，利用每個工作(task)之多餘時間(slack time)來決定是否要將每個工作相對應之快取記憶體區塊放入低漏電模式，並且保證每個工作可在其時間限制內完成。實驗數據顯示，我們所提出之漏電控制機制，與不管時間限制之漏電控制機制相比，可達到幾乎相同之漏電減少量。關鍵字: 漏電，快取記憶體，硬性即時系統英文摘要

Leakage energy consumption is an increasingly important issue as the technology continues to shrink. Since on-chip caches constitute a major portion of the processor’s transistor budget, several leakage reduction schemes have been proposed to reduce cache leakage. However, these schemes introduce performance unpredictability thereby not suitable

for hard real-time applications that require the timing constraint is met in all cases. In this paper, we propose the first approach to apply existing low leakage circuit techniques on hard real-time applications. The proposed timing-aware cache leakage control mechanism exploits task slack time to turn cache lines into the low-leakage state provided that the timing constraint is met. The experimental results show that the proposed control policy achieves comparable leakage reduction to the leakage control policy that aggressively turn cache lines into low leakage mode without considering the timing constraint.

Keywords: leakage, cache, hard real-time system 二、計畫的緣由與目的

Power consumption is becoming a critical design issue of embedded systems due to the popularity of portable devices such as cellular phones and personal digital assistants. As the technology continues to shrink, leakage power is becoming a dominant factor to overall CPU energy [13]. Reducing leakage energy can be done by exploiting task idle time to shut down the CPU completely [4, 5, 9, 10] or individual microarchitecture component, for example, caches [7, 19] and branch predictors [11]. Previous works on applying shutting down techniques to hard real- time systems only focus on turning off a CPU completely [4, 5, 9, 10]. We are not aware of any

(4)

research work that applies micro-architectural leakage reduction techniques to hard real-time systems. This work is the first attempt to bridge this gap.

Since On-chip caches constitute a major portion of the processor’s transistor budget and account for a significant share of leakage, we target at reducing on-chip cache leakage in this project. In fact, leakage is projected to account for 70% of the cache power budget in 70nm technology [13]. Therefore, reducing cache leakage power consumption is important for reducing a processor’s total leakage. Two types of circuit techniques have been proposed to reduce cache leakage: Gated-Vdd [19] and drowsy caches [7]. The gated-Vdd technique turns off a cache line completely to save maximum leakage power, but the loss of state exposes the system to incorrect turn-off decisions which result in significant performance penalty. The drowsy cache technique uses a small supply voltage to retain the data in a memory cell at the low leakage state [7, 14]. Therefore, the drowsy cache technique reduces leakage less than the gated-Vdd technique, but it incurs much less penalty when accessing a memory cell at the low-leakage state. The delay to switch a memory cell from the low-leakage state to the active state is called wake-up overhead.

In this project, we propose the first timing-aware cache leakage control mechanism for hard real-time systems. To achieve energy savings with hard real-time guarantee, we exploit both static and dynamic slack to tolerate delay caused by accessing low-leakage cache lines. Unlike previous works that choose between the drowsy cache or gated-Vdd, our scheme allows joint use of both techniques. We exploit task-level information to manage cache lines of idle and

active tasks differently. For cache lines allocated to an active task, due to short idle period between accesses, only the drowsy cache technique is considered. These cache lines are turned into the drowsy mode periodically, and waken up when they are accessed. The period to turn all caches lines to the drowsy mode is referred to as the drowsy window size. A smaller drowsy window size leads to higher leakage savings at the cost of higher wake-up overheads. Our timing-aware cache leakage control mechanism chooses the smallest drowsy window size provided that the timing constraint is met. For cache lines allocated to idle tasks, we seek opportunities to turn cache lines off completely to get more leakage gain as long as the penalty of fetching data from the lower level memory hierarchy does not cause the violation of timing constraint.

We evaluate the proposed leakage control scheme on 8 real applications. The experimental results show that with tight deadlines, the simple policy in [7] causes high deadline miss ratio. (e.g., with 1% static slack1, the deadline miss ratio2 is up to 97.6%.) This confirms our assertion that existing leakage reduction techniques are not suitable for hard real-time applications, and a timing-aware leakage control scheme is a must. With 1% static slack, the proposed scheme has leakage reduction ranging from 78.4% to 86.9% with hard real-time guarantee, while the simple policy achieves leakage reduction from 89.7% to 90.6% with tasks missing deadlines. This shows the proposed scheme sacrifices leakage savings to satisfy the timing constraint. As task slack

1

Static slack = 1 –

∑

=

n

i 1Wi/Pi , where Wi and Pi are the

WCET and period of a task i among n tasks in a task set.

2

Deadline miss ratio = (Nmiss tasks /Ntotal task) , where Nmiss tasks is the number of tasks that missed deadline, and Ntotal tasksis the total number of executed tasks.

(5)

increased, the discrepancy of leakage savings between the proposed scheme and the simple policy decreases, and the leakage savings of the proposed method is approaching that of the simple policy. With 20% of static slack, our scheme achieves 1.3% more leakage savings than the simple policy. Joint use of drowsy caches and gated-Vdd also leads to more leakage savings. When the proposed scheme has opportunities to turn off the cache lines of a idle task, the proposed scheme achieves 2.8% more leakage reduction than the one with the drowsy caches only.

三、研究方法及成果

In this section, we first introduce the system model we discussed in this project, and then we describe the proposed timing-aware leakage control policy.

1. System Model

The system consists of a task set of n periodic real time tasks. These tasks are independent tasks and preemptable. Tasks are denoted as T = {τ1, τ2…

τn}, where T denotes the task set andτi denotes

the i-th task of n tasks. Each τi has its own

period Pi and its WCET Wi. We assume a task’s deadline is its period. Tasks are scheduled using the EDF scheduling policy. A task with earlier deadlines gets higher priority. The scheduler has two queues: waiting queue (Qwaiting) and ready

queue (Qready). The waiting queue contains the

completed tasks, and the ready queue contains the running and preempted tasks. The task that is currently running is the active task, and the completed and preempted tasks are idle tasks. The schedualibitlity of a task set is tested by the CPU utilization U defined

∑

= = n i Wi Pi U 1 / . If U is less

than 100%, the task set is said to be schedulable.

The baseline cache architecture that supports cache locking described in [11] is shown in Figure 1. The lock_ctrl signal indicates whether a cache line can be replaced or not. We select instructions to be locked in the instruction cache based on the locking algorithm described in [11]. Each cache line is associated with leakage mode bits to select the supply voltage. A cache line can be turned into either the drowsy caches or state-destructive mode (i.e. the gated-Vdd circuit). We use the terms drowsy mode and state-preserving mode interchangeably in this report.

Tag array Data array

Leakage mode D ec o d er D ec o d er D ec o d er D ec o d er tag index address = Leakage mode voltage controller voltage controller hit lock_ctrl

Figure 1. Baseline cache architecture of the proposed scheme.

2. Timing-aware leakage control

The objective of the proposed leakage management scheme is to determine the drowsy window size for active tasks and the leakage mode for idle tasks, provided that the timing constraint is not violated. The details of the proposed scheme are described as the follows.

2.1 Leakage Control Scheme for Active Tasks The leakage control scheme for active tasks is based on the Drowsy+Simple policy proposed in [7]. Different from Drowsy+Simple in [7] that uses fixed drowsy window size, the proposed leakage control scheme for active tasks adjust drowsy window size dynamically with hard real-time guarantee. The drowsy window size

(6)

affects the leakage savings and the performance overhead caused by waking-up drowsy cache lines. With a shorter window, cache lines are set to the drowsy mode more frequently thereby achieving higher leakage reduction. But it also causes higher wake-up overhead. As illustrated in Figure 2, to meet the timing constraint, the total wake-up overhead cannot exceed a task’s slack. Therefore, our leakage control scheme is to decide the smallest drowsy window size so as the timing constraint is met. That is, the wake-up overhead of all drowsy windows does not exceed the total slack time. The slack time of a task comes from two sources. One is called static slack that is computed based on the WCET. The other is called dynamic slack which is due to variations of task execution time. The leakage control scheme for active tasks contains off-line and on-line phases. Below we describe two phases in details.

Slack time Period With Leakage Control Without Leakage Control

Full speed execution

Wakeup overhead

Drowsy window

Figure 2. Illustration of using wake-up overehads to consume task slack time.

2.1.1 Off-line Phase Static Slack Allocation

We first allocate static slack to tasks statically based on their worst case preemption rates. According to the run-time slack reclamation algorithm described in the next section, the additional run-time slack of a low priority task is less likely to be used by other tasks. Therefore, to increase the total CPU utilization, we allocate static slacks to tasks with higher priorities.

Assume for all i; j, if i < j, then Pi < Pj . The number of preemption PN(τk) of a taskτk in the

worst case is





∑

− = = 1 1 / ) ( k i k i k P P PN

τ

The static slack time, ρk, allocated to a taskτkis

∑

₌ − × − × = _n i k k i k PN PN U P 11/ ( ) ) ( 1 ) 1 ( τ τ ρ

Worst Case Active Set Analysis

To estimate the performance overhead by activating drowsy cache lines in a drowsy window, we need to predict the number of cache access in a drowsy window. The number of cache lines that can be accessed in a drowsy window in the worst case is all the cache lines that could be accessed in the future. To obtain this information, we first construct the CFG (Control Flow Graph) of a program. In the CFG, each node represents a basic block, and an edge from node a to node b indicates that an execution path exits from basic block a to basic block b. L(B1)=3 B1 B2 L(B6)=4 B6 L(B7)=5 B7 AS(B2) = 7 AS(B7) = 5 AS(B6) = 4+5 = 9 AS(B1) = max(3+7 , 3+9) = 12

B1,B6,B7 : Normal basic block.

B2,B3,B4,B5 : Merged as one basic block since they are in a loop. L(Bi): number of locked cache lines touched by Bi.

AS(Bi): Active set size of basic block Bi.

AS(Bi) = max{L(Bi) + Active(Bj)} , where Bj is the child of Bi.

L(B2)=2

L(B3)=3 L(B4)=1

L(B5)=2 B3 B4

B5

Figure 3. Example of the CFG for the worst case active set analysis.

Figure 3 shows an example of the CFG, and the worst case active set (WCAS) analysis is performed on the CFG. In Figure 3, each node is

(7)

associated with L(Bi), which is the number of locked cache lines in basic block Bi. The WCAS size of each node Bi, which is denoted by AS(Bi), is the maximal number of locked cache lines that could be accessed from Bi. Therefore, AS(Bi) is calculated by

AS(Bi) = max{L(Bi) + AS(Bj)}; ∀B ∈_j child(B_i)

WCAS analysis is performed at compile time. To convey the WCAS size to the cache controller, which performs the leakage control, we use a store instruction to write the WCAS size to the cache controller, and the cache controller triggers drowsy window resizing on receiving a WCAS size. To prevent frequent drowsy window resizing, we merge basic blocks of a loop into one, and insert the store instruction at the loop entry point. As shown in Figure 3, B2, B3, B4 and B5 form a loop, and the active set size information is recorded on

B2 only.

2.1.2 On-line Phase

Dynamic Slack Reclamation

Dynamic slack is from variations of task execution time, and the collection of the dynamic slack time is performed by the OS when a context switch occurs. The dynamic slack reclamation process is similar to the one proposed in [15]. Before we detail dynamic slack reclamation, we first define five notations:

UiCPU : the unused CPU budget of τi.

Wirem :the remaining WCET ofτi.

Si: the slack time ofτi.

Ei: the execution time ofτi.

DS: dynamic slack time

When a task arrives (i.e., removed from the waiting queue), UiCPU and

rem i

W are initialized to (WCET + static slack) and WCET, respectively. During the execution ofτi., U_iCPU is consumed,

and rem i

W decreases. rem

i

W is updated by cache controller during task execution. Since the wake-up overhead of drowsy cache line does not estimated in WCET, at every cycle, W_irem is decreased by one when there is no drowsy cache hit. Note that we do not claim the slack time of preempted tasks as in [15]. In our scheme, a preempted task could utilize its slack to turn its cache lines into the low leakage mode during the idle period. Whenτi is preempted or completes,

we first consume the dynamic slack (DS) from unused CPU budget of the tasks in Qwaiting with

earlier deadlines. Then, we update U_iCPU of task τi. DS is estimated by the following equation:

∑

_∈ = waiting k Q CPU K U DS τ

If DS is greater than Ei, U_iCPU is not consumed. Otherwise, the CPU budget is updated using the following formula.

CPU i

U = UiCPU- (Ei - DS)

Therefore, the slack time that a task can use to compensate the wake-up overheads is:

Si = (U_iCPU -W_irem) + DS

Drowsy Window Resizing

The process of drowsy window resizing is to decide the smallest drowsy window size such that the timing constraint is met. Drowsy window resizing is performed when a context switch occurs or when the active set changes. To decide the drowsy window size of the scheduled task, we have to find the smallest drowsy window size with the wake-up overhead that is not larger than the task’s available slack. Therefore, the drowsy window size is the smallest window size that satisfies the following inequality:





activei i

rem

i wsize S OH S

W / × ₍₎× < ( 1 )

, where wsize denotes the window size, Sactive(i) denotes the WCAS size of task τi, and OH

(8)

denotes the number of cycles to wake up a drowsy cache line.

2.2 Leakage Control Scheme for Idle Tasks For idle tasks, we could turn their cache lines into the state-preserving or state-destructive mode depending on the length of the idle period and the slack time. The leakage control for idle tasks is performed by the OS when a context switch occurs. The slack Si and idle period Ii of a completed or preempted task are given below:

Completed tasks: Si = ρi Ii =Tarrive(τi) – Tenter_q(τi) Preempted tasks Si = U_iCPU- W_irem Ii = BCET(τcurr)

, where BCET(τcurr) is the best case execution time

of the current active task, Tarrive(τi) is the next

arrival time ofτi, and Tenter_q(τi) is the timeτi

entering the waiting queue.

To decide the leakage mode of an idle task, we need to evaluate the performance overhead

(Poverhead(Mi)) and the energy overhead

(Eoverhead(Mi)) of a low leakage mode Mi, where

Mi is either the drowsy or state- destructive mode.

Poverhead(Mi) and Eoverhead(Mi) are:

Poverhead(Mi) = Nwake ×××× Dwake(Mi)

Eoverhead(Mi) = Nwake ×××× Ewake(Mi)

, where Nwakedenotes the number of times to wake

up cache lines in low leakage mode, and Dwake(Mi)

and Ewake(Mi) denote the delay and energy

overhead to wake up cache lines in low leakage mode Mi. For the state-preserving mode, the wake-up overhead is 2-cycle for putting both tag and data array into the drowsy mode, and the wake up energy is the energy required to charge a drowsy cache line from the drowsy state to the active state. For the state-destructive mode, the

wake-up overhead is the latency and energy to access the next level memory hierarchy. To turn an idle task’s cache lines into a low leakage mode Mi, the task must have

(1) Poverhead(Mi) ≦ Sidle, and

(2) Eoverhead(Mi) ≦ Eleak reduction(Mi)

, where Eleak reduction(Mi) denotes the leakage

reduction obtained by applying low leakage mode

Mi, and Eleak reduction(Mi) is derived from the

following formula:

Eleak reduction(Mi) =

(Eleak active(Mi) ¡ Eleak low(Mi)) × Iidle - Eoverhead(Mi)

, where Eleak active(Mi) and Eleak low(Mi) denote the

leakage energy of cache lines in the active and low leakage mode Mi, respectively. Iidle is the idle

length of the idle task.

To determine the leakage mode of idle tasks, we evaluate the performance overhead and leakage reduction achieved by both the gated-Vdd and drowsy cache circuits. We choose the low-leakage mode with the most leakage reduction while meeting the timing constraint as the leakage mode of an idle task.

3. Experimental Results

For cache leakage evaluation, we use the HotLeakage tool set [23]. HotLeakge is developed based on the Wattch [3] tool set. HotLeakage explicitly models the effects of temperature, voltage, and parameter variations, and has the ability to recalculate leakage currents dynamically as temperature and voltage changed at runtime due to operating conditions. To simulate multi-tasking workloads, we modified HotLeakage to allow multiple programs executing simultaneously. We also implement the EDF scheduler. In our experiment, cache locking is performed on L1 I-cache. Since we also put the tags into the drowsy

(9)

mode, the performance overhead of accessing a drowsy line is set to 2 cycles according to [16].

Table 1. Simulated architecture parameters. Processor Core

Instruction window 16-RUU, 16-LSQ

Issue width 1 instruction per cycle, in-order issue

Functional units 4 IntALU, 1 IntMult Div, 1 FPALU, 1 FP Mult Div Memory Hierarchy

L1 I-cache Size 8KB, 2-way, 16B block size

L2 cache Size 32KB, 4-way, 32B block size, 8-cycle access latency Memory 8-cycle access latency

Energy Parameter Processor technology 0.07nm

Supply voltage 0.9V

Temperature 353

Table 2. Task set characterization.

Name Description Code size(byte) WCET(cycles)

Small task set (Total code size 7608 bytes)

Jfdctint JPEG integer implementation of the forward DCT 3296 19087

Crc cyclic redundancy code example program 1400 142088

Ludcmp Linear equations by LU decomposition 2336 16607

Matmult Matrix multiplication 576 12555

Medium task set (Total code size 9192 bytes)

Qurt Computation of roots of quadradic equations 1200 4038

Minver Matrix inversion 3656 11281

Jfdctint JPEG integer implementation of the forward DCT 3296 18969

Fft1 FFT Cooly-Turkey algorithm 1040 8685

The detailed processor and memory hierarchy parameters are shown in Table 1. We implement

two leakage control mechanisms, the

Drowsy+Simple scheme proposed in [7], and the proposed timing-aware leakage control scheme (TALC). For the Drowsy+Simple scheme, we determined the drowsy window size through exhaustive simulations and chose the best one on the average, 1000-cycle [7]. The cache lines allocated to idle tasks are turned into the drowsy

mode immediately when a context switch occurs. The benchmarks used in this work are from the SNU real-time benchmark suite [1]. The benchmark programs are C sources which are collected from numerical calculation programs and DSP algorithms. We mix multiple applications together to form two multi-tasking workloads, the small task set and the medium task set. Details of the workloads are listed in Table 2. The WCET of each task is measured with cache locking. To

(10)

generate varying execution time, we use the method similar to [8]. We assume the BCET of a task as a percentage of its WCET. In our experiments, the (BCET/WCET) ratio is set to 0.95. The execution time of each task instance is generated by a normal distribution with mean μ = (WCET + BCET)=2 and standard deviation ρ = (WCET ¡ BCET)=6. The task instance is forced to terminate once its execution time is expired.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1% 2% 3% 4% 5% Static slack M is s ra ti o

Small task set Medium task set

Figure 4. Deadline miss ratio of Drowsy+Simple.

We first show the deadline miss ratio of the Drowsy+Simple scheme to demonstrate the importance of designing a timing-aware leakage control algorithm. We adjust the period of each task to achieve 1%, 2%, 3% , 4% and 5% static slack. Figure 4 shows the ratio of tasks missing deadlines with different static slack. For the small task set, the miss ratio is 86.3% and 0.4% when the static slack is 1% and 2%, respectively. For the medium task set, the miss ratio is up to 97.9% and 95.6% when the static slack is 1% and 2%, respectively. Drowsy+Simple has higher miss ratio in the medium task set than in the small task set. The medium task set has larger total code size and has more instructions locked in the cache than those of the small task set. Therefore, the Drowsy+Simple scheme incurs more performance degradation in the medium task set than in the small task set. Although Drowsy+Simple only

misses the deadline in the cases with a tight schedule, this is still not acceptable for a hard real-time system that requires the system to always meet the timing constraint. This confirms our assertion that existing leakage reduction techniques are not suitable for hard real-time applications. Our timing-aware leakage control algorithm is guaranteed to meet the timing constraints, and the miss ratio is zero in all cases.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1% 5% 10% 15% 20% Static slack L ea k ag e sa v in g s

Drowsy+Simple, small TALC, small Drowsy+Simple, medium TALC, medium

Figure 5. Evaluation of leakage reduction.

Figure 5 compares the energy savings achieved by our TALC scheme vs. the Drowsy+Simple mechanism with 1%, 5%, 10%, 15% and 20% static slack. Note that for fair comparison, in this set of experiments, the TALC scheme turns the cache lines of idle tasks into the drowsy mode only. We show the experimental results for the small and medium task sets separately. When the static slack is 1% where Drowsy+Simple has 86.3% and 97.6% of tasks missing their deadlines with the small and the medium task set, in order to satisfy the timing constraint, the TALC scheme achieves less energy savings than drowsy+Simple. From Figure 5, we also observe that TALC achieves less leakage reduction with the medium task set than the small task set. Since TALC assumes the worse case

(11)

active set for drowsy window resizing, it could overestimate the wake-up delay. For the medium task set, the overestimation is more serious than the small one since the medium task has larger code size and longer worst-case execution path. A more precise active set analysis scheme could help alleviate this problem. We leave this as the future work. As slack time increased, the energy savings achieved by TALC approaches Drowsy+Simple. With 20% static slack, the proposed scheme has 1.1% and 1.3% more leakage savings that Drowsy+Simple for the small and medium task set, respectively. This energy advantage provided by TALC over Drowsy+Simple comes from run-time drowsy window resizing. With 20% static slack for the small task set, the window size ranges from 13 cycles to 979 cycles while Drowsy+Simple fixed the window size to 1000-cycle.

Table 3. Leakage savings of TALC-drowsy and TALC-dual. Static slack TALC-drowsy TALC-dual Differences

20% 90.9% 93.3% 2.4%

30% 91.7% 94.2% 2.5%

40% 92.9% 95.6% 2.7%

50% 93.9% 96.6% 2.7%

60% 94.2% 97.0% 2.8%

To evaluate the effect of turning off cache lines of idle tasks completely, we create a new task set that has idle periods long enough for the state-destructive mode. To lengthen the idle period, we can increase both static and dynamic slack. To increase static slack, we set 20%, 30%, 40%, 50% and 60% static slack in this set of experiments. To increase dynamic slack, we prolong a task’s WCET by increasing the number of iterations executed by the task’s major subroutines on the worst-case execution path. The BCET/WCET ratio remains 0.95 as the original setup, and the a task’s

dynamic slack increases with its WCET prolonged. The experimental results of this new task set are shown in Table 3. In Table 3, TALC-drowsy denotes the TALC scheme with the drowsy mode only, and TALC-dual denotes the TALC scheme

with both the drowsy mode and the

state-destructive mode. The results show that turning off cache lines of an idle task completely achieves up to 2.8% more leakage saving than that of TALC-drowsy.

四、結論

In this project, we propose a timing-aware cache leakage control scheme for hard real-time system. The basic idea of the proposed algorithm is to consume system slack by the performance overhead caused by activating the drowsy cache lines. The proposed scheme manages cache lines of active and idle tasks differently. The objective of the proposed leakage management method is to determine the drowsy window size for the active task, and the leakage mode for the idle task provided that the timing constraints is not violated. Experimental results show that, although our scheme achieves less leakage savings than Drowsy+Simple with tight schedule, our scheme provides the timing constraint is met in all cases while Drowsy+Simple has tasks miss deadlines. With task slack increased, the discrepancy between leakage savings of our scheme and Drowsy+ Simple decreases. With 20% static slack, our scheme even achieves 1.3% more leakage savings than Drowsy+Simple. This energy advantage provided by the proposed scheme comes from run-time drowsy window resizing. With the task set that has opportunities to put cache lines into state-destructive mode for idle tasks, the proposed scheme achieves 2.8% more leakage savings than

(12)

the proposed scheme with the drowsy mode only.

This research work has been submitted to the 2007 International Conference on Compilers, architecture and Synthesis for Embedded Systems (CASES 2007).

五、參考文獻

1. Snu real-time benchmarks. In

http://archi.snu.ac.kr/realtime/benchmark/index.html.

2. ARM946E-S.

http://www.samsung.com/products/semiconductor/asic/i pcorelibrary/intellectureproperties/processorcores/armco res/ddi0201 a946es.pdf.

3. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th annual

international symposium on Computer architecture (ISCA’00), 2000.

4. J.-J. Chen, H.-R. Hsu, and T.-W. Kuo. Leakage-aware energy-efficient scheduling of real-time tasks in multiprocessor systems. In Proc. the 12th IEEE

Real-Time and Embedded Technology and Applications Symposiums (RTAS ’06), 2006.

5. J.-J. Chen and T.-W. Kuo. Procrastination for leakage-aware rate-monotonic scheduling on a dynamic voltage scaling processor. In Proc. of Conference on

Languages, Compilers, and Tools for Embedded Systems 2006(LCTES ’06), 2006.

6. A. Cortex-R4F. http://www.arm.com/pdfs/cortex-r4f

7. K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of the 29th annual

international symposium on Computer architecture 2002(ISCA’ 02), 2002.

8. R. Jejurikar and R. Gupta. Integrating preemption threshold scheduling and dynamic voltage scaling for energy efficient real-time systems. In RTCSA, 2004. 9. R. Jejurikar and R. Gupta. Dyanmic slack reclamation

with procrastination scheduling in real-time embedded systems. In Proceedings of the 42nd annual conference

on Design automation, 2005.

10. R. Jejurikar, C. Pereira, and R. Gupta. Leakage aware dynamic voltage scaling for real-time embedded systems. In Proc. the 41st Design Automation Conference

(DAC ’04), 2004.

11. P. Juang, K. Skadron, M. Martonosi, Z. Hu, D. W. Clark, P. W. Diodato, and S. Kaxiras. Implementing branch-predictor decay using quasi-static memory cells.

ACM Transactions on Architecture and Code Optimization (TACO), 1.

12. S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th annual

international symposium on Computer architecture 2001(ISCA’ 01), 2001.

13. N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin, M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static power. IEEE

Computer, 36. 16

14. N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge. Drowsy instruction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Micro-35, 2002.

15. W. Kim, J. Kim, and S. Min. A dynamic voltage scaling algorithm for dynamic-priority hard real-time systems using slack time analysis. In Proceedings of the

conference on Design, automation and test in Europe (DATE ’02), 2002.

16. Y. Li, D. Parikh, and Y. Zhang. State-preserving vs. non-state-preserving leakage control in caches. In

Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, 2004.

17. S. Martin, K. Flautner, T. Mudge, and D. Blaauw. Combined dynamic voltage scaling and adaptive body biasing for lower power microprocessor under dynamic workloads. In ICCAD, 2002.

18. L. Niu and G. Quan. Reducing both dynamic and leakage energy consumption for hard real-time systems. In Proceedings of the 2004 international conference on

Compilers, architecture, and synthesis for embedded systems (CASES ’04), 2004.

19. M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Gated-vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In

Proceedings of the 2000 International Symposium on Low Power Electronics and Design (ISLPED00), 2000. 20. I. Puaut and D. Decotigny. Low-complexity algorithms for static cache locking in multitasking hard realtime systems. In Proceedings of the 23rd IEEE REAL-TIME

SYSTEMS SYMPOSIUM (RTSS02), 2002.

21. S.-H. Yang, B. Falsafi, M. D. Powell, K. Roy, and T. N.

Vijaykumar. An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance i-caches. In Proceedings of the 7th

International Symposium on High-Performance Computer Architecture (HPCA ’01), 2001.

22. W. Zhang and J. S. Hu. Compiler-directed instruction cache leakage optimization. In Proc. the 35th Annual

International Symposium on Microarchitecture (MICRO ’02), 2002.

(13)

and M. Stan. Hotleakage: A temperature-aware model of subthreshold and gate leakage for architects.

(14)

出席國際學術會議報告

報告人姓名楊佳玲服務機構及職稱台灣大學資工系副教授會議時間地點 Nice, France April 16 – 20, 2007

會議名稱 Design Automation & Test in Europe

發表論文題目 Energy-Efficient Real-Time Task Scheduling with Task Rejection

一、參加會議經過

2007 Design Automaton and Test in Europe 於 2007/4/16 ~ 2007/4/20 於法國尼斯舉行。Date 乃為歐洲最大之 EDA (Electronics and Design Automation)會議，與會人數眾多，此次會議共含 11 session 及 5 tutorials 。本人於 Session 11: Real-Time Methodology 發表論文 “Energy-Efficient Real-Time Task Scheduling with Task Rejection” ，此篇論文之發表，於會中廣獲好評，於 paper presentation 之後，與多位學者進行深入之討論。本人除參與論文發表外，並和與會研究人員進行意見交流，獲益良多。

二、與會心得

1. System-level design 仍為當今SOC design methodology 研究上一主要課

題。此次會議的兩個 keynotes 皆指出其重要性。

2. Process variation 是 nano-technology 下一重要設計要素，此次會議中有多

篇論文討論 process variation 對architectural design 之影響。

3. System-wide power management 是未來 ubiquitous communication device

(15)

三、攜回資料名稱之內容

2007 Date 會議論文集光碟片一片。

四、結語

非常感謝國科會提供補助，使得我得以成行。也使得我們有機會與國

外同領域的學者交換 low power embedded system design發展及研究的心

得。

(16)

Energy-Efficient Real-Time Task Scheduling with Task Rejection

∗

Jian-Jia Chen, Tei-Wei Kuo, Chia-Lin Yang Department of Computer Science and

Information Engineering

National Taiwan University Taipei, Taiwan.

Email:{r90079, ktw, yangc}@csie.ntu.edu.tw

Ku-Jei King xSeries Development

IBM Systems Technology Group (STG) Email: [email protected]

Abstract

In the past decade, energy-efficiency has been an important system design issue in both hardware and software managements. For mo-bile applications with critical missions, both energy consumption reduction and timing guarantee have to be provided by system en-gineers to extend operation duration and maintain system stability. This research explores real-time systems composed of homogeneous multiple processors with the capability of dynamic voltage scaling (DVS), in which a given task can be rejected with a specified value of rejection penalty. The objective is to minimize the summation of the total rejection penalty for the tasks that are not completed in time and the energy consumption of the system. This study provides analysis to show that there does not exist any polynomial-time ap-proximation algorithm for the studied problem, unlessP = N P. Moreover, we propose algorithms for systems with ideal and non-ideal DVS processors. The capability of the proposed algorithms is provided with extensive evaluations. The evaluation results reveal that our proposed algorithms could derive effective solutions of the energy-efficient scheduling problem with task rejection considera-tions.

Keywords: Energy-Efficient Scheduling, Task Rejection,

Real-Time Task Scheduling.

1. Introduction

Along with the low-power demands in electronic circuit designs, a modern processor can now operate at different supply voltages to balance its power consumption and performance. Different supply voltages lead to different execution speeds on a dynamic voltage scaling (DVS) processor. Well-known DVS processors for embed-ded systems are Intel StrongARM SA1100 processor [17] and Intel XScale [18]. Moreover, technologies, such as Intel SpeedStepR _and

AMD PowerNOW!TM

, provide dynamic voltage scaling for laptops to prolong the battery lifetime.

In the past decade, energy-efficient designs have received a lot of attention in industry and academics. For systems with real-time demands, energy-efficient task scheduling has been studied to min-imize the energy consumption with timing guarantee, especially for uniprocessor systems with DVS supports. Due to the convexity of the power consumption function, implementations in multiproces-sor systems are often more energy-efficient [2]. Moreover, since many chip makers, such as Intel and AMD, are releasing multi-core chips, multiprocessor energy-efficient scheduling is becoming more and more important. Various heuristics were proposed for energy consumption minimization under different task models in multipro-cessor environments, e.g., [1, 4–7, 15, 19] for independent real-time tasks and [9, 20] for real-time tasks with precedence constraints.

Due to the increase of leakage power consumption in technology, researchers have started exploring energy-efficient scheduling with ∗_{Support in parts by research grants from ROC National Science}

Coun-cil NSC-95-2752-E-002-008-PAE, Aim for Top University Plan 95R0062-A100-07, and IBM Faculty Award.

the considerations of the non-negligible power consumption of leak-age current [12]. For uniprocessor scheduling, Irani et al. [10] pro-posed approximation algorithms for aperiodic real-time tasks. For periodic real-time tasks in uniprocessor systems, Jejurikar et al. [12], Lee et al. [14], and Chen et al. [8] provided scheduling algorithms with task procrastination to decide when to turn the processor into a dormant mode. Moreover, Chen et al. [6] developed approximation algorithms for multiprocessor leakage-aware scheduling.

However, most studies for energy-efficient real-time task schedul-ing do not take task rejection into considerations. Most heuristics for multiprocessor energy-efficient scheduling cannot guarantee the schedulability of the derived schedules. Chen et al. [6] applied the constraint violation approach to augment the highest available speed with a4

3 factor. However, resource augmentation might not be

pos-sible since it is hardware-dependent. Hence, some tasks might be rejected to guarantee the schedulability of the selected tasks.

This research explores systems with the possibility to reject a task for execution with a specified cost (penalty). If a task is more important than another, its rejection penalty should be specified with a greater value. We consider a homogeneous multiprocessor system with continuously available speeds or discretely available speeds. The objective is to minimize the summation of the total rejection cost for the tasks that are not completed in time and the energy consumption of the system. The contribution of this paper is on two folds. Firstly, we show theN P-hardness of the studied problem, and provide analysis on the non-existence of polynomial-time approximation algorithms, provided thatP = N P. Secondly, we propose a branch-and-bound approach and heuristic algorithms. The proposed algorithms are evaluated by extensive experiments. The evaluation results reveal that our proposed algorithms could derive effective solutions of the energy-efficient scheduling problem with task rejection considerations.

The rest of this paper is organized as follows: Section 2 defines the energy-efficient task scheduling problem with task rejection and provides the hardness analysis. Section 3 presents our algorithms. Experimental results for the performance evaluation of the proposed algorithms are presented in Section 4. Section 5 is the conclusion.

2. Problem Definition and Hardness Analysis

Processor models This paper explores energy-efficient scheduling onM homogeneous DVS multiprocessors, where the power con-sumption function of each task is the same on every processor. The power consumption functionP (s) of the adopted processor speed on a DVS processor can be divided into two partsPd(s) and Pind, in whichPd(s) is dependent (Pindis independent, respectively) upon the processor speeds [21]. The speed-dependent power tion function is mainly contributed by the dynamic power consump-tion resulting from the charging or discharging of CMOS gates and the short-circuit power consumption, while the leakage power sumption contributes the major of the speed-independent power con-sumption. The algorithms proposed in this paper can be adopted with many power consumption function formulations, such as those in

(17)

[16,§5.5]. We consider systems with Pd(s) as a convex and increas-ing function, e.g.,Pd(s) ∝ sαfor anyα > 1.

The number of CPU cycles executed in a time interval is linear of the processor speed. That is, the number of CPU cycles completed in time interval (t1, t2]is

Rt2

t1 s(t)dt, where s(t) is the processor

speed at timet. The energy consumed in (t1, t2]is

Rt2

t1 P (s(t))dt.

We first target ideal processors, in which a processor may operate at any speed in [Smin, Smax]. We also show the extension to cope

with non-ideal processors with discrete speeds. For non-ideal pro-cessors, there areH available speeds indexed by s1, s2, . . . , sHin an increasing order. For non-ideal processors, for brevity,sH+1and

P (sH+1)are both assumed∞, Sminiss1, andSmaxissH. When needed, turning the processor into a dormant mode (or turning the processor off) might further reduce the energy consump-tion. However, turning off or waking up a processor takes time and has energy overheads. For processors with non-negligible overheads to be turned off, the overheads could be treated as part of the over-heads to turn on the processor [6, 10]. We denote Esw (tsw, re-spectively) as the energy (the time, rere-spectively) requirement of the

switching overheads for the whole process on turning off the

proces-sor and then turning on the procesproces-sor.

Task models Tasks considered in this paper are periodic and inde-pendent in execution. A periodic task is an infinite sequence of task instances, referred to as jobs, where each job of a task comes in a regular period. Each taskτiis associated with its initial arrival time (denoted asai), its computation requirement in CPU cycles (denoted asci), and its period (denoted aspi). The relative deadline of each taskτiis equal to its periodpi. That is, the arrival time and dead-line of thej-th job of task τiareai+ (j − 1) · piandai+j · pi, respectively. We assume that all the tasks arrive at time 0, but ex-tensions can be achieved easily for tasks with different arrival times. Given a task setT, the hyper-period of T, denoted by L, is defined as the minimumL so that L/piis an integer for any taskτiinT. For example,L is the least common multiple (LCM) of the periods of tasks inT when the periods of tasks are all integers. Without loss of generality, we only consider tasksτis with c_pi_i ≤ Smax, since it is

not possible to complete any taskτjwith_pcj_j > Smaxin time.

This research explores systems with the possibility to reject a task for execution with a specified cost (penalty) provided by system designers. If a task is more important than another, its rejection cost should be specified with a greater value. If a task instance of task

τiis not completed in time, the system receivesχi penalty, where

χi> 0. (If a task can be rejected without penalty, we can reject the task directly.) If a task is very important and cannot be rejected, its rejection cost should be specified as∞. If the rejection costs of all the tasks are infinite, all the tasks are asked to be completed in time.

Problem definition This paper explores the problem on the min-imization of the energy consumption of the system and the rejec-tion cost at the same time. We pursue the objective on the linear combination of the energy consumption and the rejection cost, i.e., (1− α)E + αΠ, where α is a non-negative factor no more than 1 specified by the system designer, E is the energy consumption of the system in the hyper-period, and Π is the total rejection penalty of the task instances missing their deadlines in the hyper-period. If energy consumption minimization is more important than task rejec-tion penalty minimizarejec-tion,α should be specified as close to 0, and vice versa.

For notational brevity, we normalize the rejection penalty of task

τiasαχi, the power consumption functionP () as (1 − α)P (), the energy switching overheads as (1− α)Esw. Hence, the objective of the linear combination can be treated as the summation of the (normalized) penalty and the (normalized) energy consumption.

The problem explored in this paper is defined as follows: DEFINITION1. Energy-eFFicient schEduling with rejeCting Tasks

(EFFECT):

Consider a task setT of N independent tasks over M identical processors with a common power consumption functionP (s). Each periodic taskτi∈ T arrives at time 0 and is associated with a

com-putation requirement inciCPU-cycles, a rejection cost (penalty)χi,

and a periodpi, where the relative deadline of taskτiispi. The

en-ergy consumption and timing of the switching overheads areEsw

andtsw, respectively. The problem is to derive a schedule ofT to

minimize the summation of the penalty (cost) of the task instances that miss their deadlines and the energy consumption of the system in the hyper-periodL of tasks in T, in which a job of task τi is

executed entirely on a processor.

For brevity, for the rest of this paper, the objective function of the

EFFECTproblem is called as energy-penalty (EP for abbreviation).

Hardness analysis Since most previous studies on multiprocessor energy-efficient scheduling did not take task rejection penalty into considerations, the schedulability of the derived schedules cannot be guaranteed, e.g., [4, 9]. As shown in [6], it is N P-hard to derive a schedule with the minimum energy consumption to complete all the tasks in time without rejecting any real-time task. The following lemma shows that theEFFECTproblem is stillN P-hard even if we have the flexibility to reject some tasks for execution.

LEMMA1. TheEFFECTproblem isN P-hard in a strong sense even whenEswis 0, and all the tasks have the same rejection penalty.

Proof. It can be proved by a reduction from the leakage-aware

multiprocessor energy-efficient rejection problem [6] with the same periodp. The rejection cost of each task is a constant greater than

P (Smax)· p. The detail is omitted due to space limitation.

Due to theN P-hardness of theEFFECTproblem, polynomial-time approximation algorithms might be pursued for the provision of approximated solutions with worst-case guarantees. A polynomial-timeβ-approximation algorithm for theEFFECTproblem must have polynomial-time complexity of the input size and could derive a solution with an objective value at most β times of an optimal solution, for any input instance. However, in addition to theN P-hardness of theEFFECTproblem, the following theorem shows the hardness on the approximability of polynomial-time algorithms. THEOREM1. There does not exist any polynomial-time

approxima-tion algorithm for theEFFECTproblem unlessP = N P.

Proof. This theorem can be proved by a gap reduction from the

N P-complete PARTITIONproblem: Given a set ofN non-negative numbers, denoted byo1, o2, . . . , oN, the PARTITIONproblem is to answer whether there is a partition of theseN numbers into two sets, so that the sum of the numbers in each set is the same. Suppose for contradiction that there is a polynomial-time (1 +)-approximation algorithm, denoted by AlgorithmA, with > 0 for the EFFECT

problem. We will show that we can use Algorithm A to answer the PARTITIONproblem in polynomial time, which contradicts the assumption onP = N P.

To solve the PARTITIONproblem by applying AlgorithmA, we have to create an input instance for theEFFECTproblem. For each numberoi, a unique taskτiis created withciasoi,pias

P_N j=1oj 2 , and χi as (1 +)( PN j=1oj), where P (s) = s 3 _and _E sw = 0.

Moreover, Smax is 1, and Smin is no more than 1. If the input

instance of the PARTITIONproblem admits a positive answer, the optimal solution for the constructed input instance isPN_j=1oj. By the construction, there exists no feasible solution with EP more than PN

j=1oj and no more than (1 +) PN

j=1oj. Since AlgorithmA is a (1 +)-approximation algorithm, Algorithm A guarantees to derive a solution whose EP isPN_j=1oj. If the input instance of the PARTITIONproblem does not admit a positive answer, the solution answered by AlgorithmA must be greater thanPN_j=1oj.

Since the construction of the input instance of theEFFECT prob-lem takesO(N) time, and Algorithm A is with polynomial-time

(18)

complexity, we can determine whether an input instance of the PAR

-TITIONproblem admits a positive answer in polynomial time by ver-ifying the solution of AlgorithmA, which is a contradiction.

3. Our Algorithms

By Theorem 1, it is impossible to derive optimal solutions or ap-proximated solutions with worst-case guarantee for the EFFECT

problem in polynomial time, unless P = N P. This section pro-vides a branch-and-bound approach and heuristics to derive solu-tions. We first partition tasks into M + 1 task sets, denoted by

T1, T2, . . . , TM, TM+1, so that the tasks in task setTmare exe-cuted on them-th processor for m ≤ M and the tasks in TM+1are rejected. The off-line derivation is obtained by assuming negligible switching overheads. Whether a rejected task instance determined in the off-line phase can be executed for performance improvement is done in an on-line fashion.

If a task has high computation requirement but low rejection penalty, it should be a good candidate to be rejected to reduce the EP, and vice versa. For the rest of this section, tasks are sorted non-increasingly according to χi

ci. We will consider the execution

or rejection of tasks in the sorted order. Moreover, throughout this section, the earliest-deadline-first (EDF) schedule will be applied for task scheduling on each processor. By [3], a task set Tm is schedulable on a processor if and only ifP_τ

i∈Tm c_i

pi ≤ Smax.

3.1 Off-line derivation of task partitions with negligible switching overheads

Although the power consumption function P (s) is a convex and increasing function, the energy consumption at speed s, which is

P (s)

s , might be not. For example, ifP (s) = s

3₊_γ, P (s)

s is a de-creasing function fors in (0,p3 γ

2]and an increasing function for

s in (p3 γ

2, Smax]. If the switching overheads are negligible, there

is a lower-bounded execution speed for tasks, referred to as the

critical speed s∗as in [6, 8, 12]. For ideal processors, the critical speed s∗ can be derived by solving d(P (s_ds∗∗)/s∗) = 0[6]. By the

definition, ifs∗ is greater thanSmin, the critical speed s∗ is

re-vised asSmin. If s∗ > Smax, s∗ isSmax. For non-ideal

proces-sors, the critical speeds∗isshwithP (sh+1)/sh+1 > P (sh)/sh andP (sh−1)/sh−1 ≥ P (sh)/shforh = 1, 2, . . . , H by taking

P (s0)/s0andP (sH+1)/sH+1as∞ for boundary checking. For clarity, we first focus on systems with ideal processors. The extensions to systems with non-ideal processors will be shown by the end of this subsection. A task partition is said a feasible solution if all the selected tasks for execution can meet their deadlines.

3.1.1 A branch-and-bound approach for ideal processors

For a given task partition (T∗₁, T∗₂, . . . , T_M∗ , T∗_M+1)withm de-fined asP_τ

i∈T∗m ci

pi. Ifm ≤ Smax for all m = 1, 2, . . . , M,

the earliest-deadline-first (EDF) schedule on each processor by executing all the tasks in Tm at speed min{s∗, m} can make all the tasks in T∗_m complete in time with the minimum energy consumption for the task partition [3]. Therefore, we can apply the depth-first search in a search tree to obtain the task parti-tion (T∗1, T∗2, . . . , T∗M, T∗M+1)with the minimum EP inO((N +

M)NM+1₎_time.

The branch-and-bound (BB) approach can be adopted to reduce the time complexity on exploration of the solution space. Since homogeneous multiprocessor systems are under considerations, we can restrictedτ1to be executed on the first processor by symmetry

or to be rejected. In our BB approach, we visit the search tree rooted fromτ1, and thek-th level represents the selection of task τkto a task setTmwithm = 1, 2, . . . , M, M + 1.

Suppose that we are at then-th level in the search tree. The basic pruning condition is on the schedulability test. Ifcn

pn+

P

τi∈Tm

ci pi

is greater than Smax, the BB approach can eliminate all subsets

containing the infeasible subset. The lower-bounded elimination is

Algorithm 1 : LEP Input:T†, T_{, n;} 1: T_{← {τ} i| n < i ≤ N}; 2: yi← 0, ∀τi∈ T,U1←Pτi∈T† ci pi; 3: for (i ← n + 1; i ≤ N; i ← i + 1) do

4: Let yi be the value between 0 and 1 which minimizes

P∗₍cipiyi+U1 M )M + (1 − yi) χi pi with ci piyi+ U1≤ M · Smax; 5: if (yi< 1) then 6: returnL · (P∗( ci piyi+U1 M )M + (1 − yi) χ_i pi + P τj∈T χj pj + PN j=i+1 χj pj); 7: else 8: U1← U1+c_pi_i; 9: returnL · (P∗(U_M1)M +P_τ j∈T χj pj) ; Algorithm 2 : BB Procedure: DFSBB(n, X)

Input:n, X, where Xiis an integer between1 and M + 1 for i < n;

1: form ← 1; m ≤ M + 1; m ← m + 1 do 2: ifm ≤ M andc_pn n+ P i:1≤i≤n−1andXiism ci pi > Smaxthen 3: continue; 4: Xn← m; 5: ifn is equal to N then

6: evaluate the EP by executingτiat theXi-th processor withXi≤

M and rejecting task τis withXi= M + 1;

7: save this task partition if the EP is better than the best solution so far; 8: else 9: T†← {τi| 1 ≤ i ≤ n and Xi≤ M}; 10: T_{← {τ} i| 1 ≤ i ≤ n and τi∈ T/ †}; 11: EPm←LEP(T†, T, n);

12: ifEPmis greater than the best solution so far then

13: continue; 14: else

15: call DFSBB(n + 1, X)

Procedure: BB()

1: sort tasks inT non-increasingly according toχi ci;

2: initializeX with Xi← M + 1, for i = 1, 2, . . . , N;

3: call DFSBB(1, X) to obtain the task partition;

applied by verifying whether the lower bound of the EP of the feasible solutions for the subsets of solutions rooted at then-th level is lower than the best solution derived so far. If the lower bound is greater than the best solution derived so far, we can prune all the subsets rooted at then-th level. For a specified partition of set

{τi | 1 ≤ i ≤ n} into two disjoint sets T†andTby rejecting all the tasks inTand executing all the tasks inT†, AlgorithmLEP, shown in Algorithm 1, can be applied to calculate a lower bound of the EP of feasible solutions, whereP∗(s) in Steps 4, 6, and 9 is

P∗(s) =  P (s), whens > s∗, and s s∗P (s ∗₎_, _otherwise. (1)

The proof for the correctness on the provision of the lower-bounded EP of AlgorithmLEPis omitted due to space limitation.

The branch-and-bound approach is presented in Procedure DFSBB in Algorithm 2, in which the search space is pruned with the feasi-bility test in Step 2 and Step 3 and the lower-bounded elimination between Step 9 and Step 13. The solution in this phase is obtained by calling DFSBB(1,X) with initialization shown in Procedure BB in Algorithm 2.

3.1.2 Polynomial-time algorithms for ideal processors

This section presents efficient algorithms, i.e., in polynomial time, for the determination of the task partition. The rationale behind the proposed algorithms is to select tasks with higher χi

兆級晶片系統前瞻技術研究－子計畫一：平台式系統晶片之節能記憶體架構(2/3)

行政院國家科學委員會專題研究計畫 期中進度報告