• 沒有找到結果。

* 行動作業系統虛擬化技術 *

N/A
N/A
Protected

Academic year: 2022

Share "* 行動作業系統虛擬化技術 * "

Copied!
47
0
0

加載中.... (立即查看全文)

全文

(1)

工業技術研究院資通所 學界分包/學研合作研究報告

*************************************************

* *

* 行動作業系統虛擬化技術 *

* Energy-Efficient System Virtualization *

* for Mobile and Embedded Systems *

* *

*************************************************

契約編號:

執行期間:102年1月1日至102年12月31日

計畫主持人:吳真貞 簽章(名):

□期中報告 ▓期末(總)報告

執行機構:中央研究院 簽印:

中 華 民 國 一零二 年 十二 月 九 日

(2)

摘 要

隨著功能越來越多,行動裝置如智慧型手機等已經成為我 們日常生活中重要的工具之一。大部分的智慧型手機上都只有 單一個作業系統。但是在某些情況下,使用者有可能會需要其 它不同的作業系統,例如想要在Windows Phone上執行iOS的應 用程式或遊戲。行動裝置虛擬化是一項讓手機或無線裝置上,

能夠同時執行多個作業系統或者虛擬機器的技術。

虛 擬 化 可 透 過 兩 種 方 式 達 成 : Hypervisor 或 者 Microkernel。Microkernel 是一個高度模組化的架構,只保留了 執行作業系統所需的最基本功能。由於只保留最基本功能,占 用的容量極小,使得它很適合被用在智慧型手機或者嵌入式系 統中。與Hypervisor相比,Microkernel有較高的安全性以及穩定 度。兩者最主要的差別在於,Microkernel 在排程時,將 VM 中的應用程式視為獨立的程序;而 Hypervisor 則是以VM 為單位做排程。此外,由於新式的處理器具備了每個核心可以 單獨開關以及使用不同電壓的能力,使用Microkernel 能夠更 有效的將程序安排到合適的核心上面做處理。

本研究之目的是為嵌入式系統及行動裝置提出一個基於 microkernel 為基礎的省電解法。為了要達到省電這個目標,

microkernel中有三個部份可能會需要被修改:(1) process排程演 算法;(2) 系統資源分配方式;(3) 執行程序間的資料傳遞。

本期計畫之研究重點為節能省電之process排程演算法設計與製 作。

關鍵詞:微內核、省能、排程、虛擬化技術,嵌入式系統、行

動裝置

(3)

ABSTRACT

As the functionalities keep enhancing, Mobile devices such as smartphones have become an important part in our daily life. Most smartphones run only one OS. However, there are occasions that require different OSes on a smartphone. For example, user may want to play an Apple (iOS) game on a Windows Phone. Mobile virtualization is a technology that enables multiple operating systems or VMs to run simultaneously on a mobile phone or connected wireless device.

Virtualization can be implemented with two alternative approaches: Hypervisor or Microkernel. Microkernel is the near- minimum amount of software that can provide the mechanisms needed to implement an OS. The small size of a microkernel makes it a better choice to be installed in smartphones or embedded systems. Compared to Hypervisor, microkernel has higher security and stability. The main difference between microkernel and hypervisor is the scheduling unit. Microkernel schedules each process in every VM, while hypervisor takes a VM as its scheduling unit.

Since new CPUs have the ability to turn off or adjust the core voltage of individual cores, each core may have different computing powers. Microkernel can make better scheduling arrangement according to the characteristic of each process and computing ability of each core.

In this proposed project, we aim to provide an energy-efficient microkernel solution to mobile devices. We propose three main research thrusts to address the energy efficiency issue: (1) energy- aware process scheduling and consolidation, (2) resource allocation optimization, (3) Inter-Process Communication (IPC) optimization.

Keywords : Key Words: Microkernel 、 Energy-efficient 、

Scheduling、Virtualization、Embedded System、Mobile device

(4)

目錄

一. 研究計畫之背景、目的及重要性...1

二. 欲解決的問題...2

三. 研究方法及進行步驟(solution) ...2

四. 結論(含1~12月計畫執行成果)...4

(5)

1

一、 研究計畫之背景、目的、重要性

隨著無線通訊技術的發展,智慧型手機已經成為人人必備的隨身裝備之一,再 加上手機功能的不斷加強,使許多原本只能在電腦上進行的工作現在都可以在手機 上完成,這歸功於手機上的處理器數量愈來愈多且處理速度也越來越快的成果。由 於智慧型手機的快速發展,許多廠商紛紛投入大量資源研發各自的行動作業系統,

例如Google的Android, Apple的iOS, Microsoft的Windows Phone等,但眾多的作業 系統卻也帶來許多的問題與不便,例如開發者必須針對不同系統去撰寫專屬的應用 程式,使開發變得複雜,而使用者也必須購買不同手機以執行特定系統的應用程式 或遊戲等,「行動作業系統虛擬化」正是解決此問題的最佳方法。

圖 1 一般作業系統架構與 Microkernel 比較

Microkernel 和 Hypervisor 是 目 前 兩 種 常 見 的 作 業 系 統 虛 擬 化 技 術 , Hypervisor 的原理是對作業系統的封裝與硬體的模擬,讓原有作業系統可在不修改 的情狀下,直接執行於 Hypervisor 的環境下,Hypervisor 本身相當於一個完整的 作業系統,包含了檔案系統(File System)、排程器(Scheduler)、虛擬記憶體管理 (Virtual Memory)、程序間通訊(Inter-Process Communication)、裝置驅動程式 (Device Drivers)等等;Microkernel 則是一個高度模組化的架構,Microkernel 只 保 留 了 執 行 作 業 系 統 所 需 的 最 基 本 功 能 , 例 如 : 執 行 緒 管 理 (Thread Management)、記憶體位置管理(Address Spacing)和程序間通訊(IPC),其他的服 務,如裝置驅動程式等,則執行於用戶獨立的空間(User Space),如果發生錯誤,

僅需重新啟動該服務即可恢復運作,相較於 Hypervisor 把全部的元件都執行於特 權模式(Privilege Mode),Microkernel 可提供較高的安全性與穩定度,最明顯的 例子就是 WindowsXP 到 Windows7 的改變,前者即是採用類似 Hypervisor 架構,

把所有作業系統元件全部執行於特權模式,因此只要有某個元件(如最常出錯的裝置 驅動程式)出錯,就會導致系統停止運作,後者採用 Microkernel 架構,把最容易 出錯導致作業系統不穩定的裝置驅動程式部份獨立出來運行於用戶空間,因此大幅 提升了系統的穩定性。

Microkernel 歷 經 了 3 個 世 代 的 演 進 , 到 目 前 已 經 相 當 穩 定 與 成 熟 , 且 Microkernel 採用模組化架構,大幅減少了執行作業系統所需的記憶體空間,因此 比 Hypervisor 更適合用於嵌入式環境。

(6)

2

二、 欲解決的問題

高性能意味著對能源的需求也愈高,但手機本身是一個能源有限的裝置,在效 能不斷提升的同時,如何省電以延長裝置使用時間已成為相當重要的研究議題。

本計劃的主要目標,是利用microkernel與傳統hypervisor特性的不同,以建 置一套以節能省電為主要目的系統虛擬化技術,讓使用者能夠透過虛擬化的方式,

在智慧型手機/嵌入式系統上執行多個虛擬機器。

隨著技術的發展,手機內處理器的運算能力與功能也越來越強。新的處理器除 了內含多核心之外,每個核心擁有獨立電壓,可以單獨對其電壓和時脈進行調整或 者開關,降低功耗。我們希望能利用"microkernel能夠直接排程虛擬機器中各項處 理程序"這個特性,配合"行動裝置處理器核心獨立電壓及時脈"這項新技術,發展一 套節能省電並能有效運用系統資源之系統虛擬化技術,包含一套以省電為目的的核 心排程演算法。

三、 研究方法及進行步驟

圖 2 系統架構圖

上圖 2為整個計劃的系統架構圖。我們的系統會建構在一多核心的 ARM 開發板 上,並將 L4 Microkernel 建置於該多核心開發板上,使系統上可同時執行多個不 同的作業系統,如 Linux、Android 和 Ubuntu 等,並於 L4 Microkermel 上加入 能 耗 管 理 模 組 。 能 耗 管 理 模 組 包 含 : (1) 核 心 頻 率 調 整 器 (Core frequency tuner);(2) 高效能低耗能排程器(Energy-efficient scheduler);(3) 收集分析 器(Profiler)。

本計劃分成系統與演算法兩個部份進行。系統面主要為提供上層演算法所需要 的資訊以及可調整處理器耗能的功能,分為:(1) Microkernel 平台建置;(2) 實 作處理器核心頻率調整機制。

(7)

3

(1) Microkernel 平台建置:我們預計採用由德國德累斯頓工業大學(TU Dresden) 所開發的 Fiasco microkernel (uKernel)為基礎,並移植到 ARM 開發板上進行實 驗。

(2) 實作處理器核心頻率調整機制:動態電壓與頻率調整(dynamic voltage and frequency scaling, DVFS)是一套可動態調整處理器效能與能源消耗的機制,最初 只能對整個處理器做調整,也就是說每個處理器核心的效能是一起被調整的,直到 2012 年,應用於行動裝置的 Exynos 4 Quad 四核心處理器把該機制延伸到可針對 個別處理器核心做調整,且可以動態開啟或關閉個別處理器核心。該機制使系統上 不同的處理器核心不再是完全相同,而有高效能與低耗功的差別出現,現有排程演 算法需針對新硬體架構做適度修改以提高能源使用率。此項目即是將新處理器的 per-core DVFS 功能加入 Microkernel 中,使排程演算法可依照需求動態調整或開 關不同的處理器核心。

演算法部份,為了達到節能省電的目的,我們必須對microkernel做適當的修 改。這些修改可能會包含以下幾個部份:(1) process排程演算法;(2) 系統資源分 配方式;(3) 執行程序間的資料傳遞。本計畫將主要專注在 (1) 節能之虛擬機器 及執行程序(process)排程與執行。以下說明節能排程之進行方法與步驟。

為了更了解智慧型手機/嵌入式系統的電量消耗,藉此對microkernel做出合適 的修改,我們必須先建立裝置的耗能模型。以處理器為例,在不同的核心電壓 V 下 每秒所需要消耗的能量都不相同,若是多核心處理器中每個core又使用不同電壓 時,會使整體能量消耗情況更加複雜。除了能量消耗之外,電壓同時也會影響到核 心時脈,也就是俗稱的"頻率",代表的是每單位時間(秒)內可處理cycle數量,跟核 心的效能相關。

關於排程演算法的部分,現有 Linux kernel 所使用的 Complete Fairness Scheduler(CFS) 並沒有將耗能加入考慮。本計畫預計提出新的排程演算法來取代 CFS。與以往在hypervisor上以VM為單位的排程不同,在micorkernel中,所有VM上 執行的應用程式都會被當成是單獨的處理程序,等待被排進core執行。另一方面,

有別於過去單核心與多核心排程,新式的多核心處理器中的每個核心,能夠獨立調 整核心電壓或者開關,使得每個核心有不同的運算能力。這使得進行排程的時候除 了考慮各處理程序的類型之外,還必須將核心的能力列入考慮。

實作完成之後則是進行模擬實驗結果,以及套用到真實系統作測量,比較兩者 之間數據的差異,找出並克服理論與現實上的差異。

(8)

4

四、 結論

Currently, we have successfully ported the fiasco microkernel and the L4Linux to a Samsung Exynos4 Cortex-A9 quad-core ARM board. Since the ARM board does not support per-core frequency tuning as claimed on the Samsung official web site, we also ported the fiasco microkernel and the L4Linux to a quad-core PC (x86), which supports per-core DVS/DFS, as our tentative experiment platform. We also developed a prototype of energy-efficient scheduling framework in the microkernel.

We address the energy-efficient scheduling issues in two contexts: off-line scheduling for computation jobs that come in batches, and on-line scheduling for the environment in which tasks may arrive at different time and may have different workload characteristics.

We propose mathematical models to formulate these scheduling problems, provide theoretical analysis and new theoretical findings, design optimal algorithms for the off-line scheduling problem for both single-core and multi-core, and design effective approximation algorithm for the on-line scheduling problem. We also report our simulation results and experiment results on the quad-core PC.

4.1 System Design and Implementation

In this section, we first give an overview of the system architecture and the interaction between the system components. We then describe the implementation details of the system.

4.1.1 System Architecture

Figure 1 System Architecture

Figure 1 gives the overview of our system architecture. There are three software layers. The bottom layer is the microkernel that executes in kernel-space (privilege mode), the middle layer consists of the components we developed. The top layer includes user applications and virtual machines that execute on top of microkernel. To achieve energy- efficiency on mobile platform, we add three components to microkernel: profiler, energy- efficient scheduler and core frequency tuner.

The profiler collects application runtime performance information, identifies workload characteristics, such as interactive task, computation task and background task,

(9)

5

analyzes application requirements, such as tolerable response time for short request and long request respectively, and collects workload of computing task, such as number of instructions to execute in each scheduling interval.

The energy-efficient scheduler decides the new scheduling policy based on the system status and current task workload collected by the profiler. The policy includes CPU core assignment for each task and task execution arrangement within each core (i.e., the execution order of the tasks on a core).

The core frequency tuner is a device driver which contains a table of supported frequencies and can change the frequency of CPU cores based on the policy generated by the scheduler.

The workflow of these components is as follows. First, the profiler collects application workload related data, and then the scheduler uses these data to decide a new scheduling policy, including task assignment and CPU frequency for executing each task.

If CPU core frequency scaling is needed, the scheduler will request the frequency tuner to scale up/down the core frequency accordingly.

4.1.2 System Implementation

We first give a high-level description on the procedure to bring up microkernel on ARM, to bring up L4Linux on the microkernel, and to tune the CPU frequency.

An installation guide is included in this report as an appendix. More detailed guide and scripts will be included in the final report and the CD. It took us more than 2 months to get the microkernel system working, mainly because there is no such installation guide out there; we had to build everything from scratch. We thought that such experience/guide maybe helpful to other research groups.

This section presents our implementation on Hardkernel ODROID-X2 development board, which contains four Samsung Exynos4412 ARM Cortex-A9 cores with 2GB memory. Developing kernel-level application on ARM platform is not easy because ARM platform doesn’t have the standard interface like x86 platform. If we want to make the kernel run on the board, we need to implement many very low-level system components, such as Timer, Interrupt controller and UART console driver, etc. This implementation process is called ‘porting’.

Based on our experience, a system developer needs to get himself/herself familiar with the following knowledge before porting the microkernel to the target board. First, ARM is a total memory-mapped processor in which all system components and peripheral devices can be controlled and accessed by normal memory instructions. This is very different from the x86 platform, which uses special instructions for system control and device access. The second is the system booting process, such as where to put the bootloader, system component initialization sequence, etc. The third is the target board specification manual that describes the system memory-mapping layout and the meaning of each bit of each system component control register. The memory layout describes the starting address and size of each control register.

4.1.2.1 System Booting Process

This section describes the booting process of ARM platform. After powering up the system, the processor loads the first instruction from its internal SROM and checks the boot monitor that contains the information of where to load the next stage bootloader program. Next, the processor loads the first stage of U-Boot bootloader which initializes UART debug console as well as sdcard device and loads the second stage of U-Boot bootloader. Then, the second stage U-Boot loads the microkernel and the microkernel

(10)

6

initializes other CPUs, L2 cache, MMU, starts system services, and then loads first guest virtual machine.

Since ARM does not have standard interface, we need to implement the low level components for the microkernel to run on ODROID board. First we need to define the memory-mapping layout that maps the physical address of each system component to kernel accessible kernel virtual address and then implement inevitable device driver, such as Timer, UART console, Interrupt Controller, etc. Interrupt controller and Timer are the most important system components. Many pieces of hardware send interrupt signals to the CPU. When the CPU notices that signal, it triggers a hardware interrupt ‒ the CPU performs a context switch, saving some information about what it was doing, then jumps to execute the "interrupt handler" associated with that particular hardware interrupt. For example, when a user types on a keyboard, the keyboard sends a key interrupt. The CPU then executes code for a key interrupt, which typically displays a character on a screen or performs a task. The most important interrupt for kernel is the "timer interrupt", which is emitted at regular intervals by a timer chip. The timer interrupt is mainly used for OS to control system execution flow, prevent user process holding CPU resource and content switch to other processes.

Context switching is the procedure of storing the state of an active process for the CPU when it has to start executing another one. For example, process A with its address space and stack is currently being executed by the CPU and there is a system call to jump to a higher priority process B; the CPU needs to store the current state of the process A so that it can suspend its operation, begin executing process B and when done, return to its previously executing process A. The detail of each implemented component will be included in the appendix.

4.1.2.2 Building L4 Microkernel

Figure 2 Architecture of L4 Microkernel

Figure 2 is the architecture of L4 Microkernel which contains the Fiasco microkernel, L4RE, and guest virtual machines. Microkernel is a simplified kernel which only contains inevitable part of a OS kernel, such as timer, interrupt capture, and only provides the mechanism for schedule, page table manipulating but the microkernel itself doesn’t know which task to schedule, what page entry to modify. In order to make microkernel work as normal kernel, it requires the policy. L4RE (Runtime Environment) provides the policy for schedule, page fault resolver, interrupt handler and device drivers. The later section will describe the steps to build a runnable microkernel-based virtualization environment. First

(11)

7

section describes the tool required to build the kernel and later the building process for L4 Fiasco microkernel, L4RE and L4Linux.

A. Preparing Required Tools

To speed up the building process, we need a cross-compiler tool-chain to compile ARM source code on faster x86 machine. ‘Sourcery G++ Lite 2011.03-41 for ARM GNU/Linux’ is a pre-built tool-chain and can be downloaded from Mentor Graphic Inc.’s website. In addition to tool-chain, we also need development tools from GNU, such as

‘make’, ‘autoconf’ and u-boot tools to build the image for loading on ARM board as well as dialog library for displaying configuration menu.

B. Fiasco Microkernel & L4RE

After implement all needed components for the target board, the kernel can be built by setting the correct option in configuration menu. There are many options in Fiasco microkernel menu, the most important 5 settings that must be chosen correctly are CPU type, platform, UART console, timer, interrupt. The CPU type is ARM, platform is Samsung Exynos, UART console number is 1, timer is Multi-core timer and interrupt is ExtGIC. For L4RE part, it contains the policy for microkernel as well as our profiler, energy-efficient scheduler and core frequency tuner. To build L4RE, select Platform option to Exynos4 which define the starting address for program loader.

Figure 3 Interaction between L4Linux and L4 Microkernel of creating new process

C. L4Linux

L4Linux is a modified Linux which can run on L4 Microkernel. The privilege part of the original Linux kernel has been removed and replaced by L4 IPC. This allows the whole L4Linux to execute in non-privilege mode. Origin privilege system calls, such as device access, task scheduler, page fault handler, interrupt handler are mapped to L4 API. These mapping are transparent to the user applications running in L4Linux. Figure 3 shows the example of creating a new process in L4Linux and the interaction between L4Linux and L4 Microkernel. When an application wants to create a new process, it calls the Linux system-call ‘create_new_process()’. This system-call will be replaced by L4 IPC

(12)

8

‘L4_task_new()’. Then the task dispatcher will call ‘L4_thread_schedule()’ to our scheduler module. After the scheduler decides a new schedule policy, it calls

‘L4_thread_switch()’ to assigns the first task in the new schedule to a CPU core and switches to it to start execution.

4.1.2.3 Energy-efficient Scheduler Modules

Figure 1 shows the system architecture. The middle layer contains three modules we developed: profiler, scheduler, frequency tuner.

The profiler module uses HPM (hardware performance monitor) to collect application execution information, such as number of instruction per schedule interval and classify the workload into interactive task, task with periodical deadlines, computation task, background task, etc. The development of the profiler is at its early stage. Currently, most of the classification is done manually. We will put more effort on the profiler in the next few months.

The scheduler module will use the performance information collected by profiler module as well as the internal data structure, such as Task Queue, Core Structure, Task Structure, and decide a new policy based on the scheduling algorithms we have developed.

Figure 4 depicts the system diagram of the scheduler.

Figure 4 System diagram of the scheduler

If the policy decides to scale up or down the core frequency, the schedule module will call ‘L4_CoreTuner_set’ to tune the frequency of CPU cores.

4.1.2.4 Core Frequency Tuner Module

To save energy, we need to tune the frequency of CPU core if current workload doesn’t need that much computing resource. In this section, we describe the implementation of frequency tuner module on both x86 and ARM platform.

A. X86 Platform

(13)

9

Intel x86 processor has a special purpose register, named MSR, for controlling CPU frequency and two instruction, rdmsr and wrmsr, for reading and writing to the MSR register. The MSR is a set of registers. MSR 0x198 is used to read the current frequency and MSR 0x199 is used to set the desired frequency. Table 1 shows the supported frequency for Intel i7-950 processor. The formula for setting frequency is as follow:

Frequency OUT = 133MHz * Multiplier

Multiplier Frequency (GHz)

12 1.6

13 1.7

22 2.9

23 3.07

Table 1 Supported frequency for Intel i7-950 processor.

The step to read/write MSR register is: first, put the MSR number to the ECX register and issue ‘rdmsr’ to read the data, the result will save in EAX register. For writing the MSR, first put the MSR number to ECX, then put desired value into EAX and issue

‘wrmsr’ to write the value to MSR register. Figure 5 shows the example of read/write to the MSR register.

Figure 5 Example of tuning frequency by MSR on Intel processor

B. ARM platform

Tuning core frequency on ARM is more complicated. It needs to setup the clock generator and the core frequency divider. APLL is a high frequency clock generator available on ARM platform. CPUDIV is the register to setup the core frequency. The equation to calculate the frequency is:

FOUT = MDIV x FIN / (PDIV x 2^SDIV)

Table 2 shows the PDIV/MDIV/SDIV value for each core frequency. The procedure to tune the core frequency is as follows:

For CPU Clock Divider (CPUDIV)

1. Write clock divider value into CLK_DIV_CPU register

2. Wait until clock is stable (CLK_DIV_STAT_CPU = 0x00000000) For APLL Clock Generator (APLL_CON0)

(14)

10

1. Set CPU clock source to another one, like MPLL (CLK_SRC_CPU-

>MUX_CORE_SEL[16] = 1)

2. Wait until clock is changed (CLK_MUX_STAT_CPU[18:16] = 0x2) 3. Set PDIV/MDIV/SDIV divider value in APLL

4. Wait until clock is stable (APLL_CON0->LOCKED[29] = 1) 5. Set CPU clock source back to APLL (CLK_SRC_CPU-

>MUX_CORE_SEL[16] = 0)

Table 2 PDIV/MDIV/SDIVvalue for each core frequency

Scheduling for Multi-cores with per-core DVFS

4.2 Energy-efficient Off-line Task Scheduling

This section presents our approach to energy-efficient offline task scheduling. First we introduce our model, including tasks, CPU processing rate, and energy consumption model. Second, we give a formal definition of the energy-efficient offline task scheduling problem, then present our theoretical findings for such class of scheduling problems, and present our proposed algorithms for task scheduling and CPU frequency tuning in both single-core and multi-core environments. Finally, we report our simulation result as well as the actual experiment results.

4.2.1 Models A. Task

A task consists of a sequence of instructions to be executed by a processor.

Mathematically, we model a task Tk∈ T as a tuple with the following parameters:

Tk = (Bk, Ak, Dk)

Bk is the number of instructions to be executed. Ak is the arrival time of Tk. Since we only consider offline task scheduling in this section, Ak is set to 0 for all Tk ∈ T. Dk is the deadline of Tk. If Tk has a deadline requirement, then Dk>Ak≧0; otherwise Dk≦0, which indicates that Tk does not have a deadline constraint.

(15)

11

We make some assumptions about task Tk. First, we assume that tasks are non- preemptive. A running task cannot be interrupted by other tasks. This assumption reduces the overhead caused by context switch and task migration. The second assumption is that tasks can be scheduled in different orders. Since we focus on offline scheduling in this section, the arrival time Ak of each task is set to be 0. In other words, the scheduler has the information on all the tasks to be scheduled, and therefore, it can decide the execution order of these tasks according to some scheduling policies.

B. Processing Rate

Since current processors support DVS/DFS technique, a processor can have different processing rates, or frequency. We model the processing rate pi of task Ti as the number of instructions executed per second by a processor. The value of pi is discrete. Different processor models can provide different processing rate selections. For example, on an x86 machine with core Intel i7-950, pi ∈ P = {1.6, 1.73… 3.06} (GHz); on ARM board with core exynos4412, pi ∈ P = {0.2, 0.3… 1.7} (GHz). In this work, we assume that while executing a task, the processing rate remains unchanged. Processing rate/frequency change only occurs when the processor/core starts to execute a task.

C. Energy Consumption

For a task Tk, let εk be the expected energy consumption of Tk in Joules. We model the energy consumption εk and execution time tk as follows.

k k k

k k k

B p Q t

B p E

) (

) (

= ε =

where Bk is the number of instructions to be executed in task Tk, pk is the processing rate used for Tk, E(pk) is the amount of energy to process an instruction using processing rate pk, and Q(pk) is the time to process an instruction using processing rate pk. We assume that if the processing rate is fixed (i.e. it does not change during execution of a task), both energy consumption (εk) and execution time (tk) are proportional to the number of instructions (Bk) to be executed.

4.2.2 Scheduling strategies

For off-line scheduling, we consider two kinds of tasks: tasks with deadline and tasks without deadline. Also we consider two different environments: single core and multi- cores. The combinations of tasks and environments result in four different scenarios. In the following sections, we first give the problem definition in each scenario, and then provide some theoretical analysis and algorithm design.

A. Tasks with deadline in single core environment

Given a set of tasks each with a deadline, the problem is to decide the execution order and the processing rate of each task, such that every task can be finished before its deadline and the overall energy consumption is minimized. In the following, we show that this problem is NP-Complete.

(16)

12

Q E

PL 2s 1J

PH 1s 4J

Table 3 Example

Assume that there are only two processing rates, pH and pL, pH equals to two times pL, Q(pL)=2*Q(pH)=2s, E(pH)=4*E(pL)=4J. (Theoretically, E is proportional to the square of 1/Q.) Given n+1 tasks, the number of instructions of the n+1-th task equals to the summation of the first n tasks, denotes as Bsum. The given time and energy constraint are 2.5Bsum (sec) and 6.5Bsum (Joule), respectively.

In this case, if the n+1-th task uses pL, which will take 2Bsum (sec) in time, the rest of the n tasks can never be finished in the remaining 0.5Bsum (sec). Thus the n+1-th task must use a higher processing rate pH, which reduces the execution time to Bsum (sec), and

consumes 4Bsum (J) energy. The constraints for the rest of the n tasks become 1.5Bsum (sec) in time and 2.5Bsum (J) in energy. In other words, we need to select some of the tasks with the sum of their number of instructions exactly 1/2 Bsum to use pH as their processing rate, and the others use pL. This is equivalent to the Partition problem, which is NP-Complete.

We employ dynamic programming to find an optimal solution. For simplicity of presentation and without loss of generality, we assume that there are only two processing rates, pH and pL. The energy consumption of finishing i tasks before time t can be

formulated as in the following equation. We can recursively find the minimum energy consumption using this equation.

) ) ( ) ) ( , 1 ( , ) ( ) ) ( , 1 ( min(

) ,

(

i t G i t Q p

H

B

i

E p

H

B

i

G i t Q p

L

B

i

E p

L

B

i

G

= − − + − − +

B. Tasks with deadline in multi-core environment

According to the previous section (4.2.2.A), deciding the processing rate of each task under time and energy constraint on single core is NP-Complete. Extending from single- core to multi-cores increases the complexity of the problem, and hence the problem is also NP-Complete. To solve the problem, we use the Earliest Deadline First Round-Robin heuristic to dispatch tasks to cores, and then apply dynamic programming to recursively find the minimum energy consumption.

C. Tasks without deadline in single core environment

The results presented in Section C and D for scheduling tasks without deadline is useful for energy/performance testing and evaluation for mobile device manufactures, as well as users of a new mobile device who want to know the energy/performance of the new device. Usually a batch of jobs is loaded in the system and starts their execution in a batch mode. In Section C and D, we show that there exists optimal solution (which minimizes energy consumption without sacrificing performance) for both single-core and multi-cores.

Given a set of tasks without deadlines, the problem is to find an execution order of the tasks and the processing rate for each task such that the overall cost is minimized. Note that we use cost instead of energy consumption, since one can always run every task with the lowest processing rate in order to minimize energy consumption. However, the performance of tasks will greatly suffer using such approach. The cost is a metric that takes both minimizing energy consumption and reducing the waiting time of the tasks into

(17)

13

consideration. Since energy and time cannot be added directly, we introduce a coefficient

H in our cost function.

We define the cost function as in Equation 1: (the execution order is T1, T2, T3, ..., Tn)

i i i

j

j j i

i

j j

i

H t H Q p B E p B

C

( ( ) ) ( )

1

1 1

1

+

×

= +

×

=

∑ ∑

=

=

ε

(1)

The cost Ci of task Ti depends on how long it has to wait until being processed, and the energy consumed by this task. tj is the execution time of task Tj. According to our model, tj equals to Q(pj) times Bj, where Bj is the number of instructions to be executed, pj

is the processing rate used for task Tj, and Q(pj) is the time to process an instruction using processing rate pj. εi is the energy consumption of Ti, which equals to Bi multiples E(pi). H is the cost ratio between time and energy, and it should be non-negative.

For example, given a set of tasks, B = {1, 2, 3, 4, 4, 6}. Assume there are only two processing rates, P = {1, 2}, Q(p)=1/p, E(p)=p2. Figure 6 shows three different execution sequences with different costs.

In Figure 6, every task in the first sequence uses processing rate equals to 1. Tasks T2 and T3 in the second sequence use a higher processing rate (i.e., p2 = p3 = 2). In the third sequence, T1, T2 and T3 use a higher processing rate. We can observe that sequence 3 has the lowest cost among these three examples. Next, we present the theoretical results on energy-efficient off-line task scheduling for single core.

Figure 6 Example of execution sequence

Lemma 1

If the execution order of the n tasks is T1, T2, T3, ..., Tn, then the total cost can be computed as in Equation 2.

{ }

=

=

=

1

× × × + ×

0 1

0

) ( )

( )

(

n

k

k n k n k

n k n n

k

k

n

H k Q p B E p B

T

F

(2)

Proof of Lemma 1

The right hand side term of Equation 2 represents energy consumption (the same as in

Equation 1). The left hand side term represents waiting time. In Equation 1, the waiting time of task Ti is the sum of the execution time of the i-1 tasks before Ti. That is, the waiting time is the delay for Ti caused by waiting for the i-1 tasks before it. Here, we can formulate the waiting time from a different view point. We can define the waiting time

(18)

14

caused by task Tk to be the delay for the n-k tasks after Tk. Let Tn-k be the last k+1th task.

There are k tasks after Tn-k. Let the execution time of Tn-k be tn-k, then the waiting time of the k tasks contributed by Tn-k is k × tn-k, as shown in the left hand side term of Equation 2.

Lemma 2

The decision on the processing rate for a task to minimize the total cost only depends on the number of tasks waiting after that task.

Proof of Lemma 2

k n k n

k n k n k

n

k n k n k

n k n k

n

B p

k C

B p

E p

Q k H

B p

E B p

Q k H

T F

×

=

× +

×

×

=

× +

×

×

×

=

) , (

)) ( ) ( (

) ( )

( ) (

It is obvious that the value of C(k, pn-k) only depends on k (the number of tasks waiting after task Tn-k) and pn-k (the processing rate for Tn-k), and is independent of Bn-k.

Lemma 3

For any four non-negative real numbers a, b, x, y, with a≧b and x≧y, we will have ay+bx≦ax+by.

Proof of Lemma 3

bx ay by ax

by bx ay ax

y x b a

+

≥ +

≥ +

0 0 ) )(

(

Theorem 1

There exists an optimal solution with the minimum cost in which the tasks are in non- decreasing order of the number of instructions.

Proof of optimality

Let C(k) = min C(k,pn-k). Based on Lemma 1 and Lemma 2, when the task execution order is T1, T2, T3, ..., Tn, the minimum total cost is:

=1

0

) (

n

k

k

B

n

k C

We will have C(k)≦C(k+1), since we can use the best processing rate of C(k+1) (say p) as the processing rate for C(k). Then

) ( ) 1 (

0 ) ( )

( ) 1 (

k C k

C

p Q H k C k

C

≥ +

×

=

− +

Assume that in the global optimal solution, there are two tasks i and j, Bi>Bj, and task i is placed before task j (that means task i will be executed before task j). According to Lemma3, changing the order of these two tasks will not increase the cost. By repeating the task order swapping until no more task pairs can be swapped, we will obtain a task

sequence in which the tasks are in non-decreasing order of the number of instructions.

(19)

15

D. Tasks without deadline in multi-core environment

Given a set of tasks without deadline, deploy these tasks to some homogeneous cores and find the execution sequence of each core that minimizes the overall cost. Same as the previous section, we use cost instead of energy consumption only.

Theorem 2

There exists an optimal solution with the minimum cost in which the tasks are assigned in non-decreasing order of the number of instructions to the cores in a round- robin fashion.

Proof of optimality

Based on Theorem 1, we learned that task m with larger Bm should be put in position k with small C(k) in order to get the optimal solution. Since the cores are homogeneous, the corresponding position k of each core will have the same C(k). Thus, we can deploy the task in a round-robin fashion and achieve minimum cost.

Same as Longest Task Last, we sort the tasks according to its Bi, and deploy the tasks to cores in a round-robin fashion, starting from the largest task. After deploying every task, tasks in each core are in descending order, thus we have to reverse the sequence, making them in ascending order. Final step is to set the processing rate to pm, which makes C(k) minimum. The output is the scheduling plan of each core with minimum overall cost.

4.2.3 Simulation and Experimental Results A. Environment

Algorithm: Workload Based Round Robin Input: n tasks, x cores

Output: execution sequence of each core, and processing rates of tasks 1. Sort the tasks by Bi in descending order.

2. From the first (largest) task to the last (smallest) do 3. Assign the task to a core in a round-robin fashion 4. end do

5. Reverse the execution order of the tasks in each core 6. For each task in each core do

Set its processing rate according to its position in the sequence.

7. end do

Algorithm: Longest Task Last Input: n tasks

Output: execution sequence and processing rates of the n tasks 1. Sort the tasks by Bi in non-decreasing order.

2. For each task Tn-k do

3. Set its processing rate pn-k to argmin(C(k,p)), p∈P.

4. end do

(20)

16

Since we cannot find an ARM platform that support individual core frequency tuning, we conduct our experiments on a quad-core x86 PC that does support individual core frequency tuning.

The following is our environment setting. We set up fiasco microkernel on a quad- core x86 machine. The frequency of each core can be separately adjusted. There are 12 frequency choices, ranging from 1.6GHz to 3.06GHz. We only use some of them in our experiments.

The benchmark we use is MiBench. Since the execution time of each application in MiBench is short (less than 1 second), we synthesis 16 workloads using these applications, with each loops for different number of iterations. The execution time of these workloads ranges from 1 seconds to 17 minutes.

The power consumption is measured by a power meter, DW-6091. The energy consumption is the integration of power reading during the execution period. Since there are other components that consume energy, we first measure the power consumption of an idle machine and minus the idle power from our experimental results.

B. Simulation

We run some simulations before conducting experiments. We have done the following simulations. Given a set of tasks, use the proposed Workload Based Round Robin to find the optimal schedule plan with the minimum cost. In the simulation, only two processing rate, 1.6GHz and 3.0 GHz, are used.

C. Experiment 1

In this experiment, we implement the scheduling algorithms WBRR in our Energy- Efficient Scheduler, and measure the actual energy consumption on a x86 machine.

Table 4 is the cost breakdown of WBRR schedule plan. The discrepancy of overall cost is 4.75%, which means that our model and simulation is quite close to real environment. The discrepancy of energy cost is 7.22%, while discrepancy of time cost is 2.79%. The reason that the energy cost has larger discrepancy between simulation and experiment might be caused by the power meter. The readings from the power meter are integers, thus produce the discrepancy.

Cost Cost(Energy) Cost (Waiting Time) Simulation 57842473 25527848 32314630 Experiment 55096215 23684531 31411684 Discrepancy 4.75% 7.22% 2.79%

Table 4 Cost decomposition of WBRR

Figure 7 is the execution results of WBRR. The x-axis is time in second. The y-axis is power in Watts, and the processing rate in GHz, respectively. The cyan-blue line is the reading from power meter, while the other four lines are the processing rate of each core.

This figure shows that the power change is more significant while reducing the processing rate of a core compared to finishing tasks earlier and put the cores to idle.

(21)

17

Figure 7 Execution results of WBRR

D. Experiment 2

The second experiment compares the cost between WBRR and the baseline scheduling in fiasco microkernel. The baseline schedules the incoming tasks to the cores in a FIFO fashion. Each core is always at the highest frequency. The baseline scheduling does not involve dynamic frequency scaling.

Cost Cost(Energy) Cost(Waiting Time) WBRR

(optimal) 55096215 23684531 31411684 Baseline 74955130 45858410 29096720 Improvement 26.49% 48.35% -7.96%

Table 5 Cost comparison between WBRR and Baseline

Table 5 demonstrates the cost of WBRR and baseline. The last row shows the improvement of changing from baseline to WBRR. The improvement in overall cost is about 26%, mostly from the cost of energy. The result shows that even if WBRR results in slightly longer waiting time compared to baseline, it saves a significant amount of energy consumption.

E. Experiment 3

In this experiment, we extend the number of processing rates a task can choose from.

We use the following five frequencies: 1.6GHz, 2.0GHz, 2.4GHz, 2.8GHz, and 3.0 GHz.

First we use the simulation to generate a schedule plan, then run the experiment and measure the cost.

(22)

18

Cost Cost(Energy) Cost (Waiting Time) 2 frequencies 55096215 23684531 31411684 5 frequencies 53371136 23680186 29690950 Improvement 3.13% 0.018% 5.48%

Table 6 Cost comparison between WBRR using 2 and 5 frequencies

Table 6 is the cost comparison using 2 processing rates and 5 processing rates. The last row shows the improvement of using 5 frequencies instead of 2 frequencies. We can see that most of the improvement comes from the cost of waiting time, while the cost of energy is only 0.018%.

Figure 8 is the execution results of WBRR using 5 frequencies. The x-axis is time in second. The y-axis is power in mW, and the processing rate in GHz, respectively. The cyan-blue line is the reading from power meter, while the other four lines are the processing rate of each core. From this figure we can observe that even if there are five frequencies, only three of them are used. The reason is that the number of tasks is not sufficient to support a task to use a higher frequency, as mentioned in Lemma 2.

Figure 8 execution results of WBRR with 5 frequencies

F. Overall Comparison

Figure 9 demonstrates the power readings of baseline, WBRR with two frequencies, and WBRR with five frequencies. The energy consumption of each scheduling metrics is the integration of its power reading. It is clear that even if baseline can finish tasks earlier, the energy consumption is significantly larger than WBRR. On the other hand, the two WBRR have similar task finishing time and energy consumption.

(23)

19

Figure 9 Execution results of WBRR and WBRR

Table 7 shows the actual energy consumption of baseline, WBRR with two frequencies, and WBRR with five frequencies. The energy consumption is the integration of the power consumption during its execution time. The improvements of the two WBRR are about 48% compared to baseline, which are consistent with the result in Table 5.

Energy Consumption

Ratio against Baseline

WBRR(2) 31591.53 0.5167

WBRR(5) 31573.58 0.5164

Baseline 61144.55 1.0

Table 7 Actual energy consumption (Joules)

4.3 Energy-efficient On-line Task Scheduling

In this section, we focus on the online task scheduling. First we introduce our task model. Second, we give a formal definition of the energy-efficient online task scheduling problem, and present our solution. The last section gives some preliminary results.

4.3.1 Task Model

Tasks running on a mobile device can be roughly characterized into three categories:

interactive task, computation task, and background task. Interactive tasks are triggered by user. The response time is curial to interactive tasks. For example, the switching between screens must be smooth while users slide their finger on the screen. We assume that the workload of an interactive task is small, and must be finished in a short time. That is, there is a deadline for each interactive task.

Computation tasks are tasks that generated by applications. For example, a game A.I.

Computation tasks also have deadline. However, their deadlines do not have such curial time constraint compare to interactive tasks. They also have larger amount of computation.

Computation tasks can be further categorized into CPU-bound and memory-bound, but we treat them equally for now.

(24)

20

Unlike the above two kinds of tasks, which have arrival times and deadlines, background tasks exist while the system is up. Background tasks, such as checking for updates and new messages, are less urgent compare to the other tasks. We assume that background tasks only require a fixed amount of cycles to “survive”.

Based on the characteristic of each kind of tasks, we model the tasks as following:

Tk = (Bk, Ak, Dk) Bk is the number of instruction to be executed.

Ak is the arrival time of Tk. Dk is the deadline of Tk.

Bk A

k D

k

Interactive ∈ ℕ ∈ ℕ ∈ ℕ

Dk < U

Computation ∈ ℕ ∈ ℕ ∈ ℕ

Dk ≧U

Background θ 0 -1

Table 8 Task parameters

Table 8 shows the constraint of each kind of tasks. U is the length of a fixed time interval. For interactive and computation tasks, the number of instruction to be executed, arrival time, and deadline are positive integers. However, an interactive task should be finished in one time interval, while computation tasks can take more than one interval.

Background tasks do not have arrival time and deadline. They only require θ cycles in each time interval U.

We further make some assumptions about tasks. First, we know the Bk of each task.

By profiling and other techniques, we can estimate the instructions to be executed in order to finish a task. Second, all the tasks are sequential tasks. Third, there are few interactive tasks in the system, since mobile devices is used by a single user. Also, the amount of instruction to be executed in a task is within the computation capacity of a core during each interval U. Last but not the least, task can migration between cores, but the cost of task migration is high due to expensive IPC communication in microkernel.

4.3.2 Scheduling strategies A. Objective

For ever time interval U, we need to decide the processing rate of each core and make a scheduling decision, such that the power consumption is minimum. Also the scheduling plan must satisfy the following objectives: every interactive task can be finished before its deadline; every computation task can execute at least x instructions in order to accomplish before deadline; every background task can executed θ instructions.

B. Current solution

We separate the time into time intervals. For each interval, we are aware of all the running tasks since we use microkernel. According to the current tasks and their

(25)

21

characteristic, we need to make a scheduling decision for this interval. The overhead of the solution must be small since it is calculated every time interval.

Our current solution, called “Iterative Earliest Deadline First+Best Fit” (IEDF+BF), includes three steps: First, estimate the number of high frequency cores; second, schedule tasks to core according to bias; third, further adjust core frequency.

The first step is to estimate the number of high frequency cores. The purpose is to use the least number of high frequency cores as possible. We use the deadline of interactive tasks to determine the number of high frequency cores. The following is the algorithm:

This algorithm iteratively finds the least number of high frequency cores. One thing worth mention is that if there is no interactive task, or interactive tasks can meet their deadline with all the core using the lowest frequency, we set i to 0.

The second step is scheduling tasks to cores. Before scheduling a task, we have to make sure the current processing rate setting can provide enough computation resources. In step one, we already decide some high frequency cores, and schedule the interactive tasks according to their deadline. The rest of the tasks are computation and background tasks.

These two kinds of tasks require a number of instructions to be executed in a scheduling interval.

Figure 10 illustrate the available computation resources for computation and background tasks. In this example, the number of high frequency is two. We schedule interactive tasks 1~3 first since they have deadlines. The rest of the computation resources are reserved for computation and background tasks. If the reserved resources are less than the requirement summation of these two kinds of tasks, we have to increase the processing rate to provide more resources in this interval.

Even if we provide enough resources, deploying tasks to core is still an NP-Complete problem. Here we apply a heuristic, Best-Fit, to schedule tasks to core. First, we sort all the computation tasks according to their number of instructions to be executed in descending order. Starting from the task with largest Bk, we deploy it to the core with the minimum amount of remaining resources that can accommodate this task. After deploying all the computation tasks, we apply the same method to background tasks. Figure 11 is an example after scheduling.

Algorithm: Iterative Earliest Deadline First Input: interactive tasks, n cores Output: the number of high frequency cores

1. Set all the cores with the lowest frequency

2. Sort the interactive tasks by Di in ascending order.

3. For i=1,n do

4. Set i cores with a high frequency.

5. Deploy interactive tasks to these cores using EDF.

6. if(every interactive tasks meet their deadline) 7. break;

8. end do 9. return i

(26)

22

Figure 10 Example of step 1

Figure 11 Example after scheduling

The third step is to further adjust core frequency. For each core, if the tasks can be finished using a lower processing rate, then we reduce it frequency; on the other hand, if this core can’t finish the assigned tasks in time, then we increase its frequency.

4.3.3 Simulation A. Environment

We run some simulations and compare the power consumption between our solution and baseline. Our simulation environment consists of a quad-core CPU. The frequency of each core can be separately adjusted. There are six frequency choices, ranging from 200MHz to 1.2GHz. The time interval U is set to be 0.1 second.

(27)

23

B. Benchmark

We execute two famous games from Google Play, Candy Crush and Monkey, and collect the workload changing as our input. Also we collect the workload changing of an app from a student project. The sampling period is one second. The frequency of each core is 1.2GHz. Figure 12, Figure 13, and Figure 14 show the results.

Figure 12 Workload changing in Candy Crush

Figure 13 Workload changing in Monkey

(28)

24

Figure 14 Workload changing in VR-project

In these figures, the x-axis is time in seconds, while the y-axis is the loading. We can calculate the number of instructions to be executed each second by multiplying the loading with the core frequency. In Figure 12, the first 30 seconds are in stage selection menu. The following 160 seconds are game playing. The last 30 seconds show some statistics. In Figure 14, the game crashes at the 18-th second, and create a new thread at the 26-th second.

As we can observe, there is only one thread in each app that results in most of the workload. These threads are the “computation threads” we expected. However, we cannot find the “interactive threads” from these figures. The reason might be that the inputs from users, such as touch events, are handled by the computation thread instead of another interactive thread. To be more specific, computation thread takes the x and y coordinate where user touched as input, and calculates the corresponding reaction.

Since these computation threads also include the behavior of interactive threads, we duplicate the threads and assign one of them to be computation thread. The other one is assigned to be interactive thread with periodical deadline 0.1 second.

Each computation thread is further divided into “computation tasks”. The number of instructions to be executed for each computation task equals to the number of instructions to be executed in 1 second. The deadline of each computation task is set to 1 second.

On the other hand, the interactive thread is also divided into “interactive tasks”.

However, the deadline of each interactive task is 0.1 second instead of 1 second. The number of instruction to be executed for each interactive task is the number of instructions to be executed in 0.1 second.

Also we pick another six threads from Candy Crush, three threads from Monkey, and one thread from VR project as background tasks. We divide these threads into background tasks.

C. Simulation Method

Starting from time 0, we invoke an “app”, which consists of different kinds of threads each activated at some specific time. During each interval U, we need to decide a frequency setting of each core that satisfies the deadline requirement of tasks from each thread. The overall energy consumption is calculated at the end.

(29)

25

Figure 15 Example after scheduling

Figure 15 demonstrates the workload changing of each thread in our simulation. The app Candy Crush starts in time 0. Monkey starts one minute later. The app VR is executed twice, in time 15 and time 210, respectively. The threads with postfix “-I” are interactive threads, the ones with “-C” are computation threads, and the ones with “-BO” are background threads.

D. Simulation Results

Table 9 shows the simulation results. We assume that the energy consumption of a 200MHz core in one interval, which is 0.1 second, equals to 1s(s is the unit of energy consumption we use in our cost model). The Baseline method always uses the highest frequency during execution. The Adaptive method has two thresholds, 30% and 80%. For each core, if the loading of its previous interval is over 80%, it will increase the frequency of this interval one level higher. For example, a 800MHz core has loading 85% in an interval. In the next interval, it will increase the frequency to 1.0GHz. On the other hand, if the loading is lower than 30%, the frequency will decrease in the next interval.

Energy After

Normalization

Baseline 2194560s 1.0

Adaptive 1374015s 0.626

IEDF+BF 739080s 0.337

Table 9 Simulation results

As can be seen from Table 9, our method, IEDF+BF, significantly reduces the energy consumption. However, both Adaptive and IEDF+BF have the same drawback: In each

(30)

26

interval, there may be some instructions that cannot be executed in that interval. Our method produces about six times of such instructions compared to Adaptive in this case.

However, we can execute these instructions in the next intervals and meet the deadline of each task.

4.3.4 Discussion

In our approach, we assume that the time interval is fixed. However, interactive tasks will not come into the system in such a regular way. Also the deadline of an interactive task is short. It may be too late to schedule and process it in the next interval. Thus we propose an alternative way. If there is a new interactive task, stop the current interval, and start a new one. This makes sure that every interactive task can be scheduled immediately.

As for computation and background tasks, the instructions not being executed in the interrupt interval can be compensated by asking for more resources in the new interval.

Scheduling tasks to cores is an NP-Complete problem. However, this is under the assumption that tasks are non-preemptive by other tasks, and no task migration between cores. If we allow the tasks to be preemptive and can migrate to other cores at most once in an interval, there might be a simple solution for scheduling these tasks. This will be one of our research directions in the next five months.

Scheduling for bigLITTLE core Architectures

4.4 Background

big-LITTLE core [1] is a heterogeneous computing architecture developed by ARM in 2011. There are two kinds of processors in big-LITTLE core architecture, one with relatively slower speed but lower power consumption, the other with higher processing power but also higher power demand. The intention is to create a multi-core environment that can provide a balance between performance and power efficiency. For example, CPU intensive tasks such as gaming or web page rendering can be run on big cores for performance reason, while other tasks such as texting and email are run on low power LITTLE cores.

The two kinds of processors in big-LITTLE core architecture must be architecturally compatible, since instructions must be transparently migrated from one to the other. The design proposed by ARM in 2011 used Cortex-A7 for LITTLE cores and Cortex-A15 for big cores. One year later, ARM announced another pair, Cortex-A53 and Cortex-A57, both are ARMv8 cores on a big-LITTLE core chip.

Currently there are three different models for arranging processor cores in a big- LITTLE core architecture design: cluster migration, CPU migration, and heterogeneous

multi-processing. These three different arrangements affect the design of kernel scheduling.

Cluster migration is the simplest design. Cores are divided into two clusters. One

contains only big cores, the other with only LITTLE cores. The scheduler can only see one of the two processor clusters at a time. If the load of LITTLE core cluster reaches a threshold, the system activates the big core cluster, transitions all relevant data by L2 cache, and resumes the execution of all task on big core cluster. Then, the LITTLE core cluster is shutdown. The transition also happens when the load of big core cluster is below certain threshold. Figure 16 demonstrates the idea of cluster migration.

(31)

27

Figure 16 Cluster migration

CPU migration pairs up one big core with a LITTLE core. Each pair is treated as a

virtual core. Only one of the two cores in a pair is powered up and processing tasks at a time. In-kernel switcher (IKS) is responsible for switching tasks between two cores: the big core is used when the demand, or load, is high; otherwise the LITTLE core is used. The scheduler only sees a virtual core instead of two cores. Figure 17 shows the idea of CPU migration.

Figure 17 CPU migration

Figure 18 Heterogeneous multi-processing

The third kind of big-LITTLE core architecture is heterogeneous multi-processing (HMP). In this model, all cores are visible to the scheduler. In other words, all the big and LITTLE cores can be used at the same time. However, the scheduler needs to be aware of different CPU processing power while scheduling. For example, threads with higher

參考文獻

Outline

相關文件

The temperature angular power spectrum of the primary CMB from Planck, showing a precise measurement of seven acoustic peaks, that are well fit by a simple six-parameter

support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm..

Since the subsequent steps of Gaussian elimination mimic the first, except for being applied to submatrices of smaller size, it suffices to conclude that Gaussian elimination

Since the subsequent steps of Gaussian elimination mimic the first, except for being applied to submatrices of smaller size, it suffices to conclude that Gaussian elimination

Continue to serve as statements of curriculum intentions setting out more precisely student achievement as a result of the curriculum.

What is the number of ways a binomial random walk that is never in the negative territory and returns to the origin the first time after 2n steps.. • Let n

– First pass: scan the source program for label p p g definition, address accumulation, and address assignment. – Second pass: perform most of the actual Second pass: perform most

相較於把作業系統核心置於 Ring 0 權限層級的作法,全虛擬化的方式是以 hypervisor 作為替代方案,被虛擬化的客作業系統 (guest operating system, Guest OS) 核心對