多核心即時資料傳輸架構的設計方法

(1)

行政院國家科學委員會專題研究計畫成果報告

多核心即時資料傳輸架構的設計方法(第 2 年) 研究成果報告(完整版)

計畫類別：個別型

計畫編號： NSC 97-2221-E-011-044-MY2

執行期間： 98 年 08 月 01 日至 99 年 09 月 30 日執行單位：國立臺灣科技大學電機工程系

計畫主持人：陳雅淑

計畫參與人員：碩士班研究生-兼任助理人員：吳旻修碩士班研究生-兼任助理人員：鄭群逸碩士班研究生-兼任助理人員：范林芳碩士班研究生-兼任助理人員：陳名揚

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中華民國 99 年 10 月 20 日

(2)

行政院國家科學委員會補助專題研究計畫成果報告

計畫名稱：多核心即時資料傳輸架構的設計方法計畫編號：NSC 97-2221-E-011-044-MY2

執行期限：計畫自民國 98 年 08 月 01 日至民國 99 年 09 月 31 日止主持人：陳雅淑國立台灣科技大學電機工程系

計畫參與人員：吳旻修國立台灣科技大學電機工程系研究所陳名揚國立台灣科技大學電機工程系研究所范林芳國立台灣科技大學電機工程系研究所鄭群逸國立台灣科技大學電機工程系研究所

中文摘要

多核心行動多媒體裝置為本世代主要的消費性電子產品，如何在多核心之間以最有效的方式提供使用者即時影音通訊與網路服務，成為目前重要議題。相較於傳統嵌入式系統，高度客製化的多核心系統晶片上的溝通架構使系統發展的複雜度隨之增加。若沒有適當的系統分析機制，將會造成系統設計無法滿足效能或是配置過多的資源而增加成本，於是乎，如何於成本和效能考量下，選擇適當的溝通架構、通訊協定以及仲裁者設計，成為急需解決的問題。

本計畫為期兩年，第一年探討如何設計雙核心資源管理機制，提供雙核心應用程式合作運算的架構，用以持續探索多核心架構的程式設計模組。

我們提出各核心獨立運作的觀念，利用排程器確保個別單元工作完成時限，並設計適合協同處理器上的排程器，以保證各應用程式效能。針對難以分析的應用程式效能參數，本計畫提出可動態調整的排程方法，以提供系統設計者動態調動整體效能或單一應用程式效能。根據第一年的獨立排程器概念簡化通訊協定，第二年提出多核心溝通虛擬化方式，將其轉化為最佳化問題，提出系統層級多核心即時資料傳輸架構的分析方法以及仲裁者設計方法。針對高度客製化的產品，我們探討影響最佳化問題難度的重要指標，並根據市場產品的實作限制，提出一連串的分析與探索演算法，以減少設計難度並減低資源成本。

關鍵詞:

多核心系統、即時系統排程、即時系統資源管理、多核心溝通架構

Abstract

With the significant driving force from the application domains, mobile multimedia systems are designed over heterogeneous multi-core SoC platforms. On such system, on-line task scheduling and inter-communication are very challenging issues. The challenge results from resolving the communication cost minimization problem, while simultaneously satisfying the timing constraints of job executions.

This project proposes an on-line dual-core

scheduling framework for dynamic workloads with real-time constraints in the first year, in order to explore the programming model for multi-core systems. The processor and the co-processor are dedicated to separate schedulers with different scheduling policies, and precedence constraints among tasks are dealt with by the interaction between the two schedulers. In the second year, based on the job execution model of the first year, we explore bus-layer minimization problems by first identifying factors that contribute to the NP-hardness of these problems. Existing proposed algorithms and NP-hard problems are then

identified and elucidated. A simulated annealing algorithm is proposed and compared with heuristics-based algorithms to provide further insights for system designers. A series of extensive simulations is carried out, and a case study is presented to show comparisons among different approaches and workloads.

Keywords:

Multi-core system, real-time scheduling, real-time resource management,

(3)

1 Introduction

Due to increasing market demand, many embedded systems will eventually run multiple applications on a system- on-chip platform. In particular, multiple processing elements (PE) (e.g., one or more microprocessors/DSP’s) are needed in such platforms, to meet the processing demands of a wide range of applications, from smart phones and mobile computing devices to HDTV. On such platforms, there is a need for computational jobs running on multiple processing elements to collaborate with each other to accomplish a task such as video encoding/decoding. This collaboration is typically achieved through signiﬁcant amounts of data exchange between the processing elements, and thus, data communication can become a performance bottleneck for these systems. To resolve this issue, multi-bus or multi-layer architectures (like those based on the Advanced Micro-controller Bus Architecture (AMBA) [1]), have been proposed by various vendors. However, empirical designs and implementations of a bus or layer architecture are likely to result in resources being over-allocated. To resolve this problem, our work adopts the notion that an approach based on rigorous theoretical foundations needs to be taken, for instance, the minimization of the multi-layer bus for a system-on-chip (SoC) system.

In the past decade, successful research results have been achieved by researchers that have proposed to minimize data communication costs under various timing constraints over multi-bus architectures, [10, 14, 16, 17, 23, 26–28]. In particular, some of these approaches [10, 14, 16, 23] have tried to minimize the number of buses through either using simulations or empirical study with system-level EDA tools, while those in [17, 26–28, 30] have focused their study on bus partitioning by applying integer linear programming or heuristics-based scheduling algorithms.

Although multi-bus architectures provide effective solutions for minimizing communication costs, they suffer from inflexibility due to the fixed partitioning of PE’s at the hardware design stage [26, 32]. In order to increase the bus capacity and provide more flexibility in resource allocation, the multi-layer bus architecture [3] and a multi-channel Network-on-Chip (NoC) [25] architecture have been proposed. However, for those techniques, the drawback is that increasing the number of bus layers or channels would imply an increase in the overhead of the cost or area, the power consumption and the complexity of SoC systems design [2, 3, 7]. For example, the gate count would grow exponentially in the multi-layer bus architecture, when the number of bus layers and the number of PE’s have been increased [2]. Unfortunately, little work has been done to study the tradeoff between the resource utilization and the system performance of a multi-layer bus or a multi-channel NoC system.

This work is motivated by the need for applying a design methodology for bus-layer minimization problems in the multi-layer bus architecture, and the need of on-line scheduler for dual-core systems. In contrast to existing simulation approaches [23] and empirical studies, we explore bus-layer minimization problems concerning diﬀerent design parameters and diﬀerent graph topologies. By identifying the factors that contribute to the NP-hardness of bus-layer minimization problems, the system designer could change program models or software implementations to resolve the communication problem with minimal cost. We also propose a scheduling framework for dual-core systems.

Two separate scheduling policies are presented for the processor and the co-processor according to the properties of each. Speciﬁcally, the framework deals with the preemption cost of the co-processor, and the cooperation between two schedulers for precedence constraints of tasks. A fast on-line admission control is also derived to manage dynamic workloads on dual-core systems.

The main contribution of this project is the recommendation of a suitable scheduler and the reconstruction guide for a communication specification of a system, such as a surveillance camera, a portable media player or a transaction terminal, so that bus-layer minimization problems can be resolved within tolerable time and cost. In this project, a transaction terminal for EFTPOS payment is presented as a demonstration example and used to provide practical considerations for real-world system designs. In the first year of this project, we explore dual-core scheduling framework to do real-time on-line task scheduling [8]. Different from traditional multiprocessors systems, the processor and the co-processor is master-slave in a dual-core system. Each task will be first executed in the processor, and then dispatched into the co-processor for executing some specific operations, and final completed in the processor. In the second year, we explore existing bus-layer minimization problems with efficient algorithms [9], such as those with unit execution time of a bus transaction (i.e., a data transmission activity between two PE’s). We then explore factors that contribute to the NP-hardness nature of many bus-layer minimization problems. Following this, a simulated annealing (SA) algorithm is presented to be used as comparison with heuristics-based algorithms to provide insight into system designs. A series of extensive simulation experiments has also been performed using different graph topologies, deadline settings, bus translation populations, etc.

The rest of this project report is organized as follows: Section 2 deﬁnes terminologies and formulates problems.

Section 3 proposes algorithms for on-line task scheduling under dual-core systems. Section 4 summarizes problems

(4)

with eﬃcient solutions. Section 5 presents NP-hard problems and a simulated annealing (SA) approach. Section 6 presents the experimental study of diﬀerent approaches. Section 7 concludes this project.

2 Problem Definition

Figure 1. Multi-layer Bus System Architecture

In this project, we are primarily concerned with the communication cost minimization problem present in the multi-layer bus architecture. Now, let us consider a real system such as a transaction terminal as an example.

This transaction terminal consists of a 3-layer Advanced High-Performance Bus (AHB) and an Advanced Peripheral Bus (APB) under AMBA as shown in Figure 1. The AHB supports, an MPU (ARM core), a Digital Signal Pro- cessor (DSP), and Direct Memory Access (DMA) devices. This bus is also a bridge to the lower bandwidth APB, where most of the peripheral devices, such as the UARTs and Timers, in the system are located. The topology of the multi-layer bus consists of a network of shared and dedicated communication channels connected to various hardware components. As shown in Figure 1, the DSP, ARM core and DMA could access the Ethernet, printer and audio devices simultaneously under this 3-layer bus architecture.

This transaction terminal provides the EFTPOS payment system. There are various security algorithms, a visual display for instructional messages, a stereo codec MP3 for the voice indication, an Ethernet connection for online transactions, and a thermal printer to print account details. Figure 2(a) represents the three task graphs pertaining to account print, message display, and voice indication on a transaction terminal. In the task graphs, each rectangle represents one function block and each arrow represents the data ﬂow of a task. As shown in Figure 2(a), the steps involved in the voice indication task are to access data from NAND ﬂash and memory, perform decoding functions with the DSP, and send voice data to the audio device in sequence.

Each task graph represents an application on the system, and the transaction on each task graph is denoted as a bus transaction graph. As shown in Figure 2(b), the bus transaction graph represents the data ﬂow on the workload of Figure 2(a) by a set of graphs G = {BT G1, BT G2, BT G3}. A bus transaction graph BT Gi consists of a set of bus transactions. Each bus transaction BTi,j denotes a data transmission between a source processing element and a destination processing element. It is represented by a vertex of the bus transaction graph BT Gi. For example, BT3,2 in the ﬁgure is a data transaction which is sent from memory to the DSP of the voice indication task with execution time eBT_3,2 = 0.017 ms. Each directed edge between the vertices denotes a precedence constraint between bus transactions. Moreover, to satisfy the timing constraint of the corresponding application, each bus transaction graph is given a deadline to complete all transactions of its graph. In this project, we assume that each bus transaction graph forms a directed acyclic graph (DAG). Each bus transaction is non-preemptive, and it can request only one bus layer for data transmission every time.

(5)

Send Strobe Data

Rotate Motor

Print BT1,4

BT1,3

BT1,2

Dot line Strobe lineto

ARM

DMA Printer Head

Motor

Print Data BT1,1

Memory

Display

Message BT2,1 ^Refresh^Memory

Display

BT2,2

ARM DMA LCD Controller

Inverse Quantization

Frequency Time to Mapping

Audio Devices Decoding

Bit Streamof

DSP

Data Stream

BT3,2 BT3,3

Read NAND Flash

BT3,1

Flash Controller Memory

(3) Voice Indication (1) Account Print

(2) Message Display

(a) Task graphs

BT_2,2

BT_2,1 BT_3,1

BT_1,4 BT1,2

BT_1,1

BT1,3

Period=1.67 ms Period=16.67 ms Period = 31.25 ms BT3,2

BT_3,3 Message Display Voice Indication Account Print

0.017ms

0.751ms 0.005ms

0.001ms 0.01ms

0.23ms

2.4ms 1.28ms

(b) Bus transaction graphs

Figure 2. Applications of a Transaction Terminal

A bus transaction BTi,j is termed an immediate predecessor of another bus transaction BTi,k, if there is a directed edge from BTi,jto BTi,k. BTi,k is an immediate successor of BTi,j. BTi,k cannot start until the completion of BTi,j. BTi,j is termed a predecessor of another bus transaction BTi,p if there is a path with one or more directed edges between BTi,j and BTi,p. BTi,pis a successor of BTi,j. As shown in Figure 2(b), BT1,2 is an immediate predecessor of BT1,4, and BT1,4 is a successor of BT1,1. The execution time of a bus transaction BTi,j, denoted as eBT_i,j, is the amount of time required to complete BTi,j. We assume that eBTi,j > 0 for all transactions in this project. The release time of BT_i,j, denoted by r_BT_i,j, is the time when the bus transaction becomes available for execution (under consideration of the precedence constraints). The start time of BT_i,j, denoted as s_BT_i,j, is the time when the bus transaction is scheduled for execution. Note that s_BT_i,j ≥ rBTi,j. The response time of BT_i,j, denoted as Resp_BT_i,j, is the amount of time between r_BT_i,j and the completion time of BT_i,j. The deadline of BT_i,j, denoted as d_BT_i,j, is the time by which BT_i,j must complete its execution. Let p_BT_i,j indicate the priority of BT_i,j, where a smaller value denotes a higher priority. Additionally, we assume that the priority of a bus transaction, BTi,j, remains unchanged from rBT_i,j to dBT_i,j. This is because if the priority of a bus transaction can be dynamically changed during that

(6)

period, the hardware complexity of both the processing element and the dynamic-priority arbiter is dramatically increased.

The Multi-layer Bus Minimization problem arises when it is necessary to derive the minimal number of bus layers required to meet the cost and performance constraints of SoC systems. The Multi-layer Bus Minimization problem is deﬁned as follows:

Problem 1 Multi-layer Bus Minimization (MBM)

Given a set of bus transaction graphs G = {BT G1, BT G2, . . . , BT GN} and a deadline for each graph, the problem is to minimize the number of bus layers required on a SoC system such that all the deadlines of the graphs in G are satisﬁed.

MBM is a diﬃcult problem, because the bus transaction scheduling problem in a multi-layer bus is NP-complete in the strong sense, even for all the graphs in G having a common release time and a common deadline on a 2-layer bus system. It can be noted that for a bus transaction graph G, a schedule of G is feasible only when all bus transactions meet the deadlines and follow the precedence constraints.

Problem 2 MBM with Common Release Time and Deadline (MBMRD)

Given a set of bus transaction graphs G, a 2-layer bus on the system, and that all the graphs in G have a common release time and a common deadline D, the MBMRD problem is to ﬁnd a scheduling algorithm to minimize the schedule length such that all graphs meet the common deadline.

Theorem 1 MBMRD problem is NP-complete in the strong sense.

Proof. This theorem is demonstrated by reducing any instance of the 3-PARTITION problem to an instance of the MBMRD problem. The 3-PARTITION problem is deﬁned as follows: Given a multiset S of 3m positive integers, the total sum of all integers being mB and each integer i, B/4 < i < B/2, can S be partitioned into m subsets S1, S2, . . . , Smsuch that the sum of the integers in each subset is B? Each subset Si is forced to consist of exactly 3 elements by the constraint B/4 < i < B/2. The 3-PARTITION problem is NP-complete in the strong sense [13].

For the multiset S which is an instance of the 3-PARTITION problem, this study constructs a set of bus transaction graphs G for the instance of the MBMRD problem from S as follows: Each integer in S forms a bus transaction BT in G, and the execution time of each bus transaction is equal to the value of each integer. Consequently, there are 3m bus transaction graphs (i.e., BT G₁, . . . , BT G_3m) in G with independent bus transactions, and the execution time of each bus transaction is within B/4 to B/2 as shown in Figure 3(a) (i.e., BT_1,1, . . . , BT_3m,1). Then, a bus transaction graph BT G_3m+1with 3m bus transactions is added into G. The precedence of bus transactions of BT G_3m+1is shown in Figure 3(a) (i.e., BT_3m+1,1, . . . , BT_3m+1,3m). The execution time of each bus transaction BT_3m+1,j is B, where j∈ 1, 4, 7, . . . , (3m − 2), and the execution time of any other bus transaction in BT G3m+1 is 1. The total execution time of the BT G_3m+1 is mB + 2m, and the total execution time of other bus transaction graphs in G is mB (i.e., BT1,1, . . . , BT3m,1) which is the total sum of all integers in S in the 3-PARTITION problem.

Assume that there is an algorithm that can solve any problem instance of the MBMRD. Specifically, given G as mentioned above, a 2-layer bus on the system, and that all the graphs in G have a common release time 0 with a common deadline D = mB + m, the algorithm finds a schedule such that all graphs meet the common deadline. If there is a feasible schedule for G, only the corresponding instance of the 3-PARTITION problem with S and B is solved (e.g., one possible schedule is as shown in Figure 3(b)). This is because the common deadline of G is mB + m, whereby each bus transaction in BT G3m+1 with execution time B must be scheduled just after two bus transactions in BT G3m+1with each execution time of 1, and the two corresponding immediate predecessors must be scheduled in different bus layers. Therefore, the other 3m bus transaction graphs can only be scheduled in m slots with duration B in the idle layer (e.g., Layer 2 as shown in Figure 3(b)). In other words, if there is a feasible schedule, 3m bus transaction graphs can be partitioned into m bus transaction graph subsets such that the sum of the execution time of the bus transactions in each subset is B. Thus, the corresponding instance of 3-PARTITION with S and B is solved.

The 3-PARTITION problem is NP-complete in the strong sense according to the proof in [13], and thus MBMRD is also NP-complete in the strong sense.

Corollary 1 MBM problem is NP-complete in the strong sense.

(7)

BT_2,1 BT_1,1

mB m

i

, B/

e B/

BTi i

e BT

= Σ

≤

<

1 , 1 ,

, 3 1

2 4

1

BT3m,1

m others mB

;

m- ,..., , , j B ; e

j BTm j

m e

BT , 2

1

2 3 7 4 1

, 1 , 3

1

3 Σ = +



 =

= +

+

B eBT₃m₊₁_,₁ =

2 1

, 1 3_m₊ = eBT

3 1

, 1 3_m₊ = eBT

B eBT₃m₊₁_,₃m₋₂ =

BT_3m+1,1

BT3m+1,2

BT_3m+1,3

BT3m+1,3m-2

BT3m+1,3m-1

BT3m+1,3m

3m BTGs from 3-partition instance

Newly added BTG

BT3m-1, 1

BT3m-2,1

BT3,1

(a) One instance of the MBMRD problem with deadline D = mB + m Layer

1 2

B B+1 mB+m

1 , 1 m 3 +

BT

^BT³^m⁺¹^,³^m⁻¹

BT₃_m₊₁_,₃m

1 ,

BT

1

BT

₂_,₁

BT

₃_,₁

2 3 , 1 m 3 + m−

BT

1 , 2 m 3−

BT

3m−1,1

BT

₃_m,₁

2 , 1 m 3+

BT

3 , 1 m 3+

BT

B e

_BT

=

+1,1 m

3

e B

BT m

=

− +1,3 2 m

1 3 1

B e e

e

_BT₁_,₁

+

_BT₂_,₁

+

_BT₃_,₁

= e e e B

m m

m BT BT

BT₃₋₂_,₁

+

₃₋₁_,₁

+

₃_,₁

=

(b) One possible schedule for the MBMRD problem

Figure 3. NP-complete Proof of the Multi-layer Bus Minimization with Common Release Time and Dead- line (MBMRD) Problem

Proof. This corollary can be proved following the proof of Theorem 1.

According to Theorem 1, there is no polynomial time algorithm to ﬁnd a feasible schedule on the multi-layer bus architecture. Corollary 1 also shows that there is no optimal method to ﬁnd the minimal number of bus layers of the MBM problem even with pseudo-polynomial time. In later sections, the complexity of the subproblems of the MBM problem is explored in two parts: P-Time solvable subproblems and NP-complete subproblems. We then present the systematic analysis methodology and propose a near-optimal strategy based on simulated annealing (SA) to resolve the MBM problem.

3 Dual-Core Scheduling

3.1 Overview

In this section, we explore dual-core scheduling framework to do real-time on-line task scheduling. Different from traditional multiprocessors systems, the processor and the co-processor is master-slave in a dual-core system. Each task will be first executed in the processor, and then dispatched into the co-processor for executing some specific operations, and final completed in the processor.

(8)

During on-line, each new task will be tested by the admission control. The admission control is used to ensure the schedulability of the new task and all other existing tasks. If the task cannot pass the admission control, it will be refused. Otherwise, it is scheduled into the processor by a deadline-driven preemptive scheduler. Subtasks are scheduled into the co-processor by a bandwidth server, because the arrival time of each subtask is unpredictable and depends on the completion time of the prior subtask. The major challenges on dual-core systems are precedence constraints between subtasks and the non-preemptive of a co-processor. In the following sections, we shall present the design of schedulers and admission control of this framework.

3.2 Scheduling on the Processor

In this section, we shall present the design of the processor scheduler in the framework. Under a dual-core system, each task is ﬁrst scheduled into the processor, and each co-processor subtask is issued by the processor subtask. From the view of the processor, this task is suspended in the processor for the co-processor execution, and it is resumed when the execution is completed. Diﬀerent from synchronization protocols, we design an independent scheduler in the co-processor and bound the response time of each co-processor subtask. Fuller discussion will be presented in next section.

By bounding the response time of each co-processor subtask, there is a separation between any two processor subtasks, and the duration of each separation is bounded. With this assumption, processor subtasks of a task can be taken as a set of sporadic tasks. Under the dual-core scheduling framework, all processor subtasks are scheduled by Earliest Deadline First (EDF) [18] with their local deadlines. To meet the timing constraint of each task, we ﬁrst assign a processor density to each task for bounding the consuming processor utilization of each subtask (Please see Theorem 2). Then, each processor subtask of a task τ_i is assigned a local deadline by the given processor density D_i of the task τ_i (e.g., d_i,j = D_i∗ ci,j+ a_i,j). Later we shall give a more precise account of the admission control and scheduling correctness.

3.3 Scheduling on the Co-processor

The objective of this section is to propose a scheduler for bounding the response time of each co-processor subtask.

We ﬁrst propose the concept of preemption point to avoid intolerable blocking times of tasks in the non-preemptible co-processor. Then, we present a bandwidth server to schedule co-processor subtasks.

Since the non-preemptive of a co-processor, a task might suﬀer too large blocking time to be schedulable. As shown in Figure 4, subtask τ2,2 is blocked by subtask τ1,2 and misses its deadline. To resolve this problem, we insert preemption points into co-processor subtasks, such that the co-processor is “semi-preemptive” for adapting a bandwidth server and bounding blocking times of subtasks.

2 ,

τ1 2 ,

τ2

Co-processor

2 ,

τ2

2 ,

τ1

arrives completes

2 ,

τ2 misses deadline

t2

t1

Figure 4. Non-preemptible Co-processor

In our framework, each co-processor subtask is scheduled by the bandwidth server at preemption points. As shown in Figure 5, we insert two preemption points (t₁ and t₂) into the subtask τ_1,2. As a result, τ_2,2 is scheduled into the co-processor at the second preemption point of τ_1,2 which is the ﬁrst preemption point just after τ_2,2 arrived. The schedulability of τ_2,2 depends on the interval of preemption points of τ_1,2, because the longest blocking time suﬀered by τ2,2 is the biggest interval of preemption points of τ1,2. The preemption point insertion shall be done carefully, since there is non-ignored context switch overhead for (re-)storing current status of τ1,2.

2 ,

τ1 2 ,

τ2

Co-processor

2 ,

τ2 2 ,

τ1

arrives τ2,2completes

t1 t2 t3 t4

2 ,

τ1 _completes

preemption point preemption point

CS C

S

Figure 5. Preemption Points on the Co-processor

Preemption point insertion of each subtask can be done by a compiler with “program slicing” [29, 31] to minimize context switch overhead. Each program slice consists of a part of program that aﬀects the values computed at some

(9)

point of interest. It is usually specified by a location in the program in combination with a subset of the program’s variables. In other words, we use compiler-based program slicing to slice each subtask, and insert some preemption points between some specific slices. By doing so, the extra context switch overhead of a subtask on the “semi- preemptive” scheduling, compared to non-preemptive scheduling in the co-processor, could be minimized. However, the interval between preemption points shall be bounded for the task schedulability. There is a trade-off between the blocking time and context switch overhead of a subtask for preemption point insertion. The heuristic to configure values of them might be out of the scope of this project. The main contribution of this idea is to give a way to trade-off blocking times and context overhead of subtasks in the co-processor. We also present the admission control in later sections to show the impact of different configuration on scheduling result.

Co-processor subtasks of a task can be taken as a set of sporadic tasks with the similar reason as the above section.

Constant Utilization Server (CUS) [24] is used to schedule co-processor subtasks. Although the co-processor is “semi- preemptive”, the deadline setting of each subtask still follows that of CUS, when the interval between two preemption points is taken as a non-preemption portion. When each task is scheduled by a speciﬁc CUS with a certain server size, the response time of each co-processor subtask is bounded by its local deadline. The other technical issue is how to assign the server size of each task, such that the task meets its deadline constraint and the system utilization is maximized. Because the issue is NP-complete, it could only be resolved by heuristic or search algorithms. The algorithms will be not discussed in this project, but the properties of diﬀerent CUS size settings will be shown in experiments.

3.4 Scheduling with Precedence Constraints

In this section, we shall present Dual-Core Scheduling (DCS) algorithm on the framework. The basic idea is using deadline assignment to translate the precedence constraint among subtasks into independent subtasks with arbitrary arrival times.

For a given task set, we ﬁrst assign a processor density Di to each task τi. Following earliest deadline ﬁrst (EDF) rules, each processor subtask is assigned a local deadline by

di,j= ai,j+ci,j

Di

where a_i,j is the arrival time of τ_i,j. All processor subtasks are scheduled into the processor by EDF with their local deadlines. Each task τ_i is also assigned a server size U_i of the corresponding Constant Utilization Server (CUS), and all co-processor subtasks of task τ_i are scheduled by the corresponding CUS at preemption points. Notably, all co-processor subtasks are inserted preemption points by a compiler during oﬀ-line analyzed. Following CUS rules, when a co-processor subtask τi,j is arriving, the local deadline of the subtask is assigned by

di,j = max(ai,j, di,j−2) +ci,j

Ui

where d_i,j₋₂ is the deadline of the immediate preceding co-processor subtask of τ_i,j. The scheduler is invoked at the preemption point, and the ready subtask with shortest local deadline will be scheduled ﬁrst. The server replenishment rules are the same as that of CUS [24]. The local deadline of each (co-)processor subtask is the latest arrival time of the succeeded subtask, so scheduling policies of the processor and the co-processor could be diﬀerent and independent.

Moreover, the local deadline of each subtask is assigned by its arrival time, so the precedence constraint of each subtask will not be violated.

3.5 Admission Control

Corollary 2 A periodic system and a collection of sporadic tasks are schedulable by EDF if the sum of utilization of the former and instantaneous utilization of the latter is no greater than 1 at any time. [24]

Theorem 2 Given a task set with the processor density assignment, if the sum of processor density is less than 1, the task set is schedulable in the processor under Dual-Core Scheduling (DCS).

Proof Because any two processor subtasks of a task are without overlap, there is only one processor subtask of each task being scheduled at any time. The processor utilization consumed by a task is no more than the maximum processor utilization consumed by any subtask of this task, and the utilization is bounded by the processor density of a task when DCS is adopted. The correctness follows Corollary 2.

(10)

Theorem 3 Given a task set, if ∑

Ui + ^{M N P D}

min^ci,j

Ui

≤ 1, ∀i, j, the task set is schedulable in the co-processor under Dual-Core Scheduling, where MNPD is the biggest interval of preemption points in the co-processor.

Proof Because the interval of preemption points in a co-processor is the non-preemption portion of a task, the correctness follows theorems in CUS [24].

Theorem 4 A task τi is schedulable under Dual-Core Scheduling if

∑

τ_i,j∈processor c_i,j Di +∑

τ_i,j∈co−processor(^2CS+c_U ^i,j

i + M N P D) ≤ di, ∀j, where CS is the context switch time of each subtask in the co-processor, MNPD is the biggest interval of preemption points in the co-processor.

Proof With the processor density and CUS size setting, each task τi can be taken as executed alone in a low speed (co-)processor following the deﬁnition of [24] without interference time consideration. A task is schedulable only when the response time is less than the relative deadline of the task. ∑

τ_i,j∈processor c_i,j

D_i and ∑

τ_i,j∈co−processor c_i,j

U_i

are computation times in a low speed processor and co-processor, respectively. Since co-processor subtasks can only be scheduled at preemption points, each co-processor subtask of τi might be blocked by a low priority task τk

for M N P D time units. Besides, without resource sharing consideration, there is only 2 context switches for each subtask [18]. Therefore, total computation time, blocking time and context switch time for a task τ_i is bounded by

∑

τ_i,j∈processor ci,j

Di +∑

τ_i,j∈co−processor(^2CS+c_U ^i,j

i + M N P D),∀j.

Notably, the response time could be bounded more tightly by extending the result proposed in [6], but the computation overhead is costly. In this project, we focus on the scheduling framework design.

Theorem 5 A new task τi is acceptable under Dual-Core Scheduling with a given task set only if Theorem 2, 3 and 4 are all granted.

4 P-Time Solvable Problems

In this section, we discuss the polynomial solvability of the reduced Multi-layer Bus Minimization (MBM) problem, to provide designers with ideas for redesigning the workload or for searching the suboptimal solution by the use of heuristics. The MBM problem can be relaxed into two subproblems: (1) The ﬁrst subproblem is to determine the number of bus layers that should be allocated, provided that feasible schedules exist. A feasible schedule of a set of bus transaction graphs G exists if all bus transactions on the schedule could meet deadlines without violating precedence constraints. (2) The other subproblem is how to determine if there is a feasible schedule, when the set of bus transaction graphs G and number of bus layers are known.

As the MBM problem is NP-complete in the strong sense, this problem can be explored in two steps by providing additional information (e.g., a specific scheduler or cost budget). In particular, the MBM problem is reduced to the first subproblem, when the schedule algorithm for system is determined or the scheduling policy of the arbiter is fixed.

Similarly, the MBM problem is reduced to the second subproblem, when the number of layers is bounded by a cost budget.

We now propose optimal algorithms for these two subproblems under some special cases, and show a pseudo- polynomial time algorithm for the MBM problem without considering precedence constraints. Designers could extend the algorithms, by reconstructing the workload with heuristics for suboptimal solutions using the provided informa- tion (e.g., a speciﬁc scheduler or cost budget). The properties of the MBM problem for developing near optimal solutions are explored in the next section.

To address the ﬁrst subproblem, we propose the Maximum overlapped bus transactions algorithm with time complexity O(n²) to determine the number of bus layers that should be allocated, provided that a feasible schedule exists, where n is the number of bus transactions. The basic idea of this algorithm is that the required number of bus layers is equal to the maximum number of bus transactions that have to be executed simultaneously within the timing constraints. The pseudo code of the Maximum overlapped bus transactions algorithm is shown in Algorithm 1.

Initially, the number of bus layers is set to 1 and the current time is set to 0 (Steps 1–2). At each iteration, we ﬁnd the ready bus transaction BTi,j that has the earliest start time and no uncompleted predecessors. Then, we set the current time t to sBTi,j, and assign BTi,jto a free bus layer BU Sk(Steps 3–17). The number of bus layers is increased when a bus transaction is ready but no free bus layers can be assigned (Steps 7–11).

Corollary 3 When a schedule of bus transactions of a set of bus transaction graphs G is given, Algorithm Maximum overlapped bus transactions ﬁnds the optimal solution for the MBM problem.

(11)

Algorithm 1 maximum overlapped bus transactions algorithm

Input : A set of bus transaction graphs G and a start time sBT_i,j for each bus transaction BTi,j in BT Gi. Output : The number of bus layers m and m bus transaction partitions in a multi-layer bus system.

1: Let m← 1

2: Let t← 0

3: while G is not empty do

4: Select the earliest start time bus transaction BTi,j in G that has all of its predecessors being executed

5: Let t← sBTi,j

6: Find a free bus layer BU Sk (timeBU S_k < t)

7: if There are no free bus layers then

8: Add a new bus layer BU S_new

9: m← m + 1

10: Assign BT_i,j to BU S_new

11: time_{BU S}_new ← t + eBTi,j

12: else

13: Assign BTi,j to BU Sk 14: timeBU S_k ← t + eBT_i,j 15: end if

16: Remove BTi,j from G

17: end while

Proof. Since the schedule of bus transactions in G is given, the latest start time of each bus transaction is known.

Therefore, the minimal number of bus layers in a multi-layer bus system is the number of bus transactions which have to be executed simultaneously within the timing constraints.

From the above discussions, it can be noted that the ﬁrst subproblem of the MBM can be solved in polynomial time, and the optimality of the solution is provided by a feasible schedule of G.

The schedule feasibility of the second subproblem, when G and the number of bus layers are given can be determined as follows. Assume that each bus transaction can only be assigned on a single bus layer for data transmission. For a given number of bus layers, we found that an optimal schedule of all bus transactions in G could be derived by applying the level strategy [15], when the execution times of all bus transactions are equal and each bus transaction graph is a tree. The strategy is as follows: Firstly, each bus transaction is assigned a label equal to the level of the transaction on the bus transaction graph. Speciﬁcally, the level of a bus transaction with immediate successors is one plus the maximum level of its immediate successors on the graph, whereas the level of a bus transaction with no immediate successor is one. The label of each bus transaction corresponds to the priority of the bus transaction, where a larger value denotes a higher priority. After each bus transaction is assigned a label, all bus transactions are then scheduled by priority under the precedence constraint, when there is a free bus layer.

Corollary 4 When all bus transactions have unit execution times and each bus transaction graph in a given set of bus transaction graphs G is a tree, the level strategy [15] ﬁnds the optimal schedule of G under the given number of bus layers, if any feasible solution exists.

Proof. Each bus transaction can be viewed as a process. The correctness of the corollary directly follows the properties of the level strategy [15].

From the study of this subproblem, we discovered that the MBM problem is difficult because: it is hard to find an optimal schedule when the bus transactions in G have precedence constraints. Thus, let us consider the situation when there is no precedence constraint and all bus transactions share the same deadline. For real system applications, the execution times of bus transactions in G might be the same when these bus transactions request an identical destination processing element. Assume that there are k types of processing elements on the system, and subsequently, k different execution times of bus transactions in G. Under this assumption, we can have a pseudo-polynomial time dynamic programming algorithm for finding an optimal solution for the MBM problem.

Corollary 5 The MBM problem could be solved in pseudo-polynomial time, when there are no precedence constraints of all bus transactions, k diﬀerent execution times of bus transactions, and a common deadline of all bus transaction graphs in a given set of bus transaction graphs G.

(12)

Proof. Assume that there are n bus transactions without precedence constraints in the given set of bus transaction graphs G, and that there is a common deadline for all bus transaction graphs. By our deﬁnition, the bus transactions in G could be represented by a k-tuple G = (n₁, n₂, . . . , n_k), where n_i is the number of bus transactions with i-th type execution time, and ∑k

j=1n_j = n. G could be partitioned into several subsets as G^′ = (p₁, p₂, . . . , p_k), where 0 ≤ pk ≤ nk. Some subsets of G^′ could be derived, such that all bus transactions in the subsets could meet the common deadline in one bus layer. The subsets are denoted by Q. A dynamic programming solution with a time complexity of O(n^2k) to resolve the MBM problem without the precedence constraint is as follows:

m = Layer(n₁, n₂, . . . , n_k)

For a given G = (n1, n2, . . . , nk),∑k

j=1nj = n Let:

G^′={(p1, p2, . . . , pk)|0 ≤ pj ≤ nj, j = 1, . . . , k} Q ={(q1, q2, . . . , qk)∈ G^′|Layer(q1, q2, . . . , qk) = 1}

Layer(p₁, p₂, . . . , p_k) = 1 + min_q_∈QLayer(p₁− q1, p₂− q2, . . . , p_k− qk) where m is the minimal number of bus layers.

When the bus transaction is assigned to a layer, the source processing element and the destination processing element of the bus transaction have to be connected to the corresponding layer. Thus, the above algorithms might result in all processing elements being connected to all layers. The connection minimization problem in each layer is also important and diﬃcult. Nevertheless, the connection minimization problem is beyond the scope of this project, and the reader is referred to [30] for further information. Moreover, to enable practical implementation on a multi-layer bus system, and to maintain backward compatibility, the assignments of processing element connections to layers are primarily decided by the speeds or the functions of processing elements. The diﬃculty of the MBM problem with this issue will be explored in the next section.

From our exploration of the subproblems, we can conclude that there are optimal algorithms with polynomial time that can solve the MBM problem with the precondition that, only when there is no precedence constraint or the execution time of each bus transactions is equal. In particular, to derive a feasible solution for MBM, a designer could extend the aforementioned algorithms, by removing the precedence constraints of bus transactions of applications with heuristics, or re-partitioning the data size such that the execution time of each bus transaction is equal. For example, consider the transaction terminal for EFTPOS payment as described in Section 2. If the bus arbiter in this system (such as a Time Division Multiple Access bus arbiter) is chosen by a prior system (e.g., a transaction terminal for credit-card payment), then the required number of bus layers could be determined by applying Algorithm 1.

Conversely, the number of bus layers in this system might be determined by the cost budget or an existing multi-layer bus development board, such as Versatile [4] or AT91CAP [5]. In that case, scheduler selection can be resolved by reconstructing bus transaction graphs (e.g., repartitioning functions into diﬀerent hardware components), according to Corollary 3 and Corollary 4.

5 NP-complete Problems

In this section, we show that other subproblems of Multi-layer Bus Minimization (MBM) are still NP-complete in the strong sense when applied practicality. Based on the results obtained, we further explore the complexity and properties of the MBM problem. We then present a simulated annealing (SA) algorithm to solve the MBM problem and compare it with heuristics-based algorithms in later sections.

5.1 Problem Analysis

In the design ﬂow of an embedded system, processing elements (PE) may be assigned to a dedicated bus layer for a speciﬁc application based on prior experiments or system design. For example, consider the transaction terminal as described in Section 2, the LCD controller in this system is typically assigned to the second bus layer as shown in Figure 1. In this manner, there are gate count or area savings because the number of components such as multiplexers and decoders in the interconnect matrix is reduced [3]. However, the MBM problem remains complex even when each bus transaction has been partitioned into a dedicated bus layer.

(13)

BT2,1

BT1,1

mB m

i

, B/

e B/

BTi i

e BT

= Σ

≤

<

1 , 1 ,

, 3 1

2 4

1

BT3m,1

m others mB

;

m- ,..., , , j B ; e

j BTm j

m e

BT Σ = +



 =

= ₊

+1, 31,

3 ,

1

1 2 5 3 1 B

e_BT₃_m₊₁_,₁=

BT3m+1,1

3m BTGs from 3-partition instance

Newly added BTG

BT3m-1, 1

BT3m-2,1

BT3,1

BT3m+1,2 BT3m+1,2m-1 BT3m+1,2m

2 1

, 1

3_m₊ =

eBT e_BT₃_m₊₁_,₂_m₋₁=B e_BT₃_m₊₁_,₂_m=1

(a) One instance of the MBMDB problem with deadline D = mB + m

Layer

1 2

B B+1 mB+m

1 , 1 m 3 +

BT

BT3m+1,2m 1

,

BT1 BT₂_,₁ BT₃_,₁

1 2 , 1 m 3 + m−

BT

1 , 2 m 3−

BT BT3m−1,1 BT₃_m,₁

2 , 1 m 3+ BT

B eBT ₊ =

1 , 1 m

3 e B

BT ₊ m₋ =

1 2 , 1 m 3

1 1

B e e e_BT +_BT +_BT =

1 , 3 1 , 2 1 ,

1 e e e B

m m

m BT BT

BT ₋ + ₋ + =

1 , 3 1 , 1 3 1 , 2 3

(b) One possible schedule of the MBMDB problem

Figure 6. NP-complete Proof of the Multi-layer Bus Minimization with Dedicated Bus (MBMDB) Problem

Corollary 6 Let each bus transaction in a given set of bus transaction graphs G be partitioned into a dedicated bus layer before scheduling. The Multi-layer Bus Minimization with Dedicated Bus (MBMDB) problem is NP-complete in the strong sense.

Proof. This theorem is demonstrated by reducing any instance of the 3-PARTITION problem to an instance of the MBMDB problem. For the multiset S which is an instance of the 3-PARTITION problem, this study constructs a set of bus transaction graphs G for the instance of the MBMDB problem from S as follows: Each integer in S forms a bus transaction BT in G, and the execution time of each bus transaction is equal to the value of each integer.

Consequently, there are 3m bus transaction graphs in G with independent bus transactions, and the execution time of each bus transaction is inclusively between B/4 and B/2. Then, we add a bus transaction graph BT G_3m+1 with 2m bus transactions into G with precedence constraint, and BT G3m+1 forms a chain as shown in Figure 6(a). The execution time of each bus transaction BT3m+1,j, where j ∈ 1, 3, 5, . . . , 2m − 1, in the chain is B and, the execution time of other bus transactions in the chain is 1.

Suppose that there is an algorithm that can solve any problem instance of the MBMDB. Hence, given G with a common deadline D = mB + m as aforementioned, we assumed that there is a 2-layer bus on the system, all bus transactions with execution time B are partitioned into Layer 1 and that the other bus transactions are partitioned into Layer 2. Under these assumptions, the algorithm ﬁnds a schedule such that all graphs meet the common deadline.

If there is a feasible schedule for G, only the corresponding instance of the 3-PARTITION with S and B is solved;

one possible schedule is shown in Figure 6(b). Following this, each bus transaction in BT G3m+1with execution time B must be scheduled just after the immediate predecessor in BT G3m+1 with execution time 1 because the common deadline of G is mB + m. Therefore, the other 3m bus transaction graphs can only be scheduled in m slots with duration B in Layer 2. In other words, if there is a feasible schedule, 3m bus transaction graphs can be partitioned into m bus transaction graph subsets such that the sum of the execution times in each subset is B. Thus, the corresponding instance of the 3-PARTITION problem with S and B is solved. The 3-PARTITION problem is NP-complete in the strong sense by the proof shown in [13], and hence MBMDB is also NP-complete in the strong sense.

Now, consider the other practical implementation issue on an event driven system. In such a system, the task will be executed by some special event. Therefore, the release time of each bus transaction graph might not be the same.

When the given release times and deadlines are not common for all bus transaction graphs, the MBM problem is still NP-complete.

Corollary 7 Let each bus transaction graph in a given set of bus transaction graphs G have a respective release time and a deadline. The Multi-layer Bus Minimization with Respective Release Times and Deadlines (MBMRTD) problem is NP-complete in the strong sense even when the number of bus layers is 1.

Proof. This corollary could be proven by transformation from the Sequencing with Release Times and Dead- lines (SRTD) problem. The SRTD problem is deﬁned as follows: Given a set tasks of T where, for each task t∈ T , a length l(t) ∈ Z⁺, a release time r(t) ∈ Z⁺0, and a deadline d(t) ∈ Z⁺, is there one processor schedule for T that satisﬁes the release time constraints and meets all the deadlines?

多核心即時資料傳輸架構的設計方法

行政院國家科學委員會專題研究計畫 成果報告

多核心即時資料傳輸架構的設計方法(第 2 年) 研究成果報告(完整版)

行政院國家科學委員會補助專題研究計畫成果報告

計畫名稱：多核心即時資料傳輸架構的設計方法 計畫編號：NSC 97-2221-E-011-044-MY2

執行期限：計畫自民國 98 年 08 月 01 日至民國 99 年 09 月 31 日止 主持人：陳雅淑 國立台灣科技大學電機工程系

計畫參與人員： 吳旻修 國立台灣科技大學電機工程系研究所 陳名揚 國立台灣科技大學電機工程系研究所 范林芳 國立台灣科技大學電機工程系研究所 鄭群逸 國立台灣科技大學電機工程系研究所

中文摘要

關鍵詞:

Abstract

Keywords:

1 Introduction

2 Problem Definition

mB m

i

, B/

e B/

= Σ

≤

≤

<

<

<

, 3 1

2 4

1

BT

BT

BT

BT

BT

BT

BT

B e

=

e B

=

B e e

e

+

+

= e e e B

+

+

=

3 Dual-Core Scheduling

4 P-Time Solvable Problems

5 NP-complete Problems

行政院國家科學委員會專題研究計畫成果報告

計畫名稱：多核心即時資料傳輸架構的設計方法計畫編號：NSC 97-2221-E-011-044-MY2

執行期限：計畫自民國 98 年 08 月 01 日至民國 99 年 09 月 31 日止主持人：陳雅淑國立台灣科技大學電機工程系

計畫參與人員：吳旻修國立台灣科技大學電機工程系研究所陳名揚國立台灣科技大學電機工程系研究所范林芳國立台灣科技大學電機工程系研究所鄭群逸國立台灣科技大學電機工程系研究所