在KVM虛擬機器中支援OpenCL圖形加速裝置

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

在 KVM 虛擬機器中支援 OpenCL 圖形加速裝置

Enabling OpenCL Support for GPGPU in Kernel-based Virtual Machine

研究生：田璨榮

指導教授：游逸平教授

(2)

Enabling OpenCL Support for GPGPU in Kernel-based Virtual Machine

研究生：田璨榮 Student：Tsan-Rong Tian

指導教授：游逸平博士 Advisor：Dr. Yi-Ping You

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science September 2011

Hsinchu, Taiwan, Republic of China

(3)

i

在 KVM 虛擬機器中支援 OpenCL 圖形加速裝置

學生：田璨榮

指導教授：游逸平博士

國立交通大學資訊科學與工程所碩士班

摘

要

現今高效能運算領域中，以異質多核心系統進行平行運算已成為一大重要發展趨勢，妥善運用不同種類核心之計算能力優勢，可大幅提高運算效能。OpenCL 即為因應愈來愈普及的異質多核心運算環境所提出的程式開發模型，幫助程式開發人員撰寫有效率、具移植性的異質多核心程式，提升計算效能，但目前於系統虛擬化環境中並不支援OpenCL，無法以系統虛擬化幫助進行更佳的OpenCL運算資源管理。在本篇論文中，我們以KVM虛擬機器為基礎提出了一個 OpenCL虛擬化架構，並以API Remoting的方式達成OpenCL運算資源多工。本論文的OpenCL 虛擬化架構分為：(一)適用於客戶虛擬機器(guest virtual machine)環境下的OpenCL函式庫，負責包裝OpenCL函式請求與回覆。(二)Virtio-CL，為一虛擬裝置，負責客戶虛擬機器與虛擬機器管理者(hypervisor)之間的資料傳輸。(三)一個新的執行緒(thread)，其負責真正執行OpenCL 函式，稱為CL執行緒。由於API Remoting的特性，OpenCL程式在OpenCL主端與客戶端間資料傳輸量直接影響虛擬化負擔。在實驗中發現，我們選用的OpenCL運算密集型(device-intensive) 測試程式僅有少量虛擬化負擔，平均為6.4%，且當客戶虛擬機器數量增加時，虛擬化負擔增幅不大，代表我們的虛擬化架構能實現有效的OpenCL運算資源管理。 關鍵詞： OpenCL，系統虛擬化，KVM，圖形處理器虛擬化，API Remoting，Virtio

(4)

ii

Student：Tsan-Rong Tian

Advisors：Dr. Yi-Ping You

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

Heterogeneous multi-core programming has become more and more important, and OpenCL, an open industrial standard for parallel programming, provides a uniform programming model for programmers to write efficient, portable code for heterogeneous compute devices. However, OpenCL is not supported in system virtualization environment, which explores more opportunities of better resource utilization. In this thesis we propose an OpenCL virtualization framework based on Kernel-based Virtual Machine (KVM) with API Remoting to enable multiplexing of multiple guest virtual machines (guest VMs) over the underlying OpenCL resources. The framework comprises three major components: an OpenCL library implementation in guest VMs for packing/unpacking OpenCL requests/responses, a virtual device, called Virtio-CL, which is responsible for the communication between guest VMs and the hypervisor, and a new thread, called CL thread, which is dedicated for the OpenCL API invocation. Although the overhead of the proposed virtualization framework is directly affected by the amount of data to be transferred between the OpenCL host and devices because of the primitive nature of API Remoting, the experiments demonstrate that the virtualization framework has only little virtualization overhead (6.4% on average) for common device-intensive OpenCL programs and performs well when the number of guest VMs involved in the system increases, which directly infers the effective resource utilization of OpenCL devices of the framework.

Keywords:

(5)

iii

致謝

首先，誠摯感謝我的指導老師游逸平教授的指導，老師嚴謹的治學方法與細心的指導及啟發，讓我在學術研究與專業能力均有所成長，更讓我學到一絲不苟的態度，無論做人處事，老師都是令我尊敬的模範。感謝 Group 中楊武教授、徐慰中教授、單智君教授在研究方面的指引與建議，指點我在研究上疏忽的方向，使本研究得以順利完成。感謝口試委員鍾葉青教授對論文的建議，給我不同的思考與研究方向，使本研究更為完整。教授們追根究底的研究精神是我學習的榜樣。感謝實驗室學長世融、柏瑲二年來的照顧。感謝同學羽軒、深弘的相互鼓勵與協助。感謝學弟妹們：翰融、聖偉、思捷、睦昂，有你們為實驗室帶來歡笑與活力。很高興與你們共事，使我的碩士生活更加豐富。最重要的，我要感謝父母的關心與鼓勵，有你們的支持，讓我得以無憂慮的專注於研究，順利完成碩士學業。也感謝一路走來幫助與鼓勵我的師長朋友們，讓我更堅定前行。誌於辛卯季夏風城交大璨榮

(6)

Chinese Abstract i

Abstract ii

Acknowledgements iii

Contents iv

List of Figures vii

List of Tables ix 1 Introduction 1 1.1 Motivation . . . 1 1.2 Thesis Overview . . . 2 2 Background 4 2.1 Introduction to OpenCL . . . 4

2.1.1 OpenCL Hierarchy Models . . . 5

2.1.2 Platform Model . . . 5

2.1.3 Execution Model . . . 5

2.1.4 Memory Model . . . 8

2.1.5 Programming Model . . . 9

(7)

2.2 System Virtual Machine . . . 10

2.3 CPU Virtualization . . . 11

2.4 I/O Virtualization . . . 15

2.4.1 Virtualizing Devices . . . 16

2.4.2 Virtualizing I/O Activities . . . 16

2.5 GPU Virtualization . . . 18

3 System Design and Implementation 21 3.1 KVM Introduction . . . 21

3.1.1 Basic Concepts of KVM . . . 22

3.1.2 Virtio Framework . . . 24

3.2 OpenCL API Remoting in KVM . . . 27

3.2.1 Software Architecture . . . 27

3.2.2 Execution Flow . . . 29

3.2.3 Implementation Details . . . 31

3.3 Related Issues of Implementation . . . 34

3.3.1 Size of Virtqueues . . . 34

3.3.2 Signal Handling . . . 35

3.3.3 Data Structures Related to Runtime Implementation . . . 35

3.3.4 OpenCL Memory Objects . . . 37

3.3.5 Enhancement of the Guest OpenCL Library . . . 38

4 Experimental Results 41 4.1 Environment . . . 41 4.1.1 Testbed . . . 41 4.1.2 Benchmarks . . . 42 4.2 Evaluation . . . 42 4.2.1 Virtualization Overhead . . . 44 v

(8)

5 Related Work 55

6 Conclusion and Future Work 58

6.1 Summary . . . 58

6.2 Future Work . . . 59

(9)

List of Figures

2.1 OpenCL platform model (adapted from [18]) . . . 6

2.2 An example of NDRange index space (adapted from [18]). . . 8

2.3 Conceptual OpenCL device architecture with processing elements, com-pute units and devices (adapted from [18]). . . 10

2.4 Native and hosted VM systems (adapted from [23]). . . 11

2.5 Types of insturctions and their relationship with respect to CPU virtualiza-tion. . . 13

2.6 The relationship among guest OSes, the hypervisor, and hardware-assisted virtualization, using Intel VT-x as an example (adapted from [3]). . . 14

2.7 Intel EPT translation details (adapted from [3]). . . 15

2.8 Major interfaces in performing an I/O action (adapted from [23]). . . 16

3.1 KVM overview. . . 22

3.2 KVM process model (adapted from [20]). . . 23

3.3 KVM execution flow in vCPU view (adapted from [20]). . . 24

3.4 Virtio architecture in KVM. . . 25

3.5 High-level architecture of virtio (adapted from [17]). . . 26

3.6 Object hierarchy of the virtio front-end (adapted from [17]). . . 26

3.7 Architecture of OpenCL API Remoting in KVM. . . 28

3.8 Execution flow of OpenCL API Remoting. . . 30

3.9 1stsynchronization point (CL thread). . . 32 vii

(10)

3.12 The data structures related to the OpenCL runtime implementation. . . 36

3.13 Shadow mapping mechanism. . . 36

3.14 Prototype of clCreateuffer(). . . 37

3.15 Memory coherence problem in clCreateBuffer(). . . 37

4.1 Normalized execution time for the Native and 1VM configurations. . . . 43

4.2 The ratio of execution time of OpenCL APIs and OpenCL host code. . . 44

4.3 The ratio of virtualization overhead of OpenCL APIs and OpenCL host code. . . 45

4.4 The breakdown of the virtualization overhead. . . 46

4.5 Profiling information of data transfer and VM exits. . . 47

4.6 OpenCL virtualization overhead of BS with variable input size. . . 48

4.7 Profile information of BS with variable input size. . . 49

4.8 The breakdown of OpenCL virtualization overhead of variable input size for BlackScholes [BS]. . . 50

4.9 Normalized execution time for the Native, 1VM, 2VM, and 3VM configu-rations. . . 51

4.10 A Scenario of multiple VMs accessing the vendor-specific OpenCL run-time. . . 52

4.11 Comparison of virtualization overhead among different virtualization frame-work. . . 53

(11)

List of Tables

2.1 Memory region—allocation and memory access capabilities (adapted from [18]). 9

2.2 Comparisons between API Remoting and I/O pass-through based on the

four criteria. . . 20

3.1 List of supported cl mem flags values (adapted from [18]). . . . 40

4.1 Statistics of benchmark patterns. . . 42

4.2 Execution time of six OpenCL benchmarks for both the Native and 1VM

configurations. . . 43

4.3 The detailed OpenCL virtualization overhead breakdown. . . 46

4.4 Profiling information of data transfer and VM exits. . . 47

4.5 OpenCL virtualization overhead and profile information of BlackScholes

[BS] with variable input size. . . 48

4.6 Configurations of input data size in this work and the related works. . . . 53

5.1 A Comparison of API Remoting-based virtualization frameworks. . . 56

(12)

Introduction

1.1 Motivation

In recent years, heterogeneous multi-core programming has become more and more impor-tant. Programmers can leverage the computing power of different heterogeneous devices and make good use of the specific computation strength of each device to get better per-formance. Some programming models are proposed to provide a unified layer of hetero-geneous multi-core programming for hiding the hardware-related details such as different memory organizations and synchronization between host and devices. Two well-known programming models, CUDA [2] and OpenCL [6], both are designed as host-device model. CUDA is proposed by NVIDIA, and they make CPUs as a host and GPUs as devices. OpenCL is proposed by Khronous group, and it is supported by the industry to become the standard of heterogeneous multi-core programming. OpenCL makes CPUs as host, but it can support lots of different architectures such as CPUs, GPUs, DSPs, etc as de-vices. These programming models can help programmers focus on high-level design and implementation.

Unified heterogeneous programming model helps programmers simplify the develop-ment process, but resource managedevelop-ment issues still remains since an user may mot always occupys the resource of heterogeneous devices. To obtain better resource utilization, sys-tem virtualization is a good solution. Syssys-tem virtualization can provide an environment that

(13)

CHAPTER 1. INTRODUCTION 2

supports multiple operating systems (OS) to execute simultaneously and perform hardware resource management to share physical resources between OSes for the better resource uti-lization. Performance of CPU and memory virtualization has been improved dramatically in the past few years thanks to the hardware assisted approach, but input/output (I/O) abil-ity and performance are still the weakness point in system virtualization. Virtualizing the GPU devices has lots of difficulties due to the closed and generation-changed architecture, and thus the guest OSes are limited to get GPU resources, including general purpose com-puting on graphic processing unit (GPGPU). As for now, there is no standard solution for using CUDA/OpenCL in system virtualization, and there has been a little research [15] [28] about enabling CUDA supports.

Because of the quickly gaining demand of heterogeneous multi-core programming and the need of better resource utilization for heterogeneous devices, we believe it is useful to combine the OpenCL programming model with system virtualization. The benefit of en-abling OpenCL support in system virtualization is to provide an automatic resource man-agement in the hypervisor, not only letting programmers not worry about the resource uti-lization issues but ensuring fair resource allocation. Such approach also acquires other ben-efits such as cost reducing and easier migration of executing environments due to the man-agement scheme provided by virtual machine. Hence, we are going to build an OpenCL support in system virtual machine (system VM) and to prepare an environment for further studies on how to share hardware resources of heterogeneous devices fairly and efficiency.

1.2 Thesis Overview

In this thesis we present our methodologies for enabling OpenCL support in system virtual machine. To provide the ability of running OpenCL programs in a virtualized environ-ment, we develop an OpenCL virtualization framework in the hypervisor and build an VM-specific OpenCL runtime. We will present Virtio-CL, an OpenCL virtualization

(14)

im-plementation in Kernel-based Virtual Machine (KVM) [21]. We will evaluate the semantic-correctness and effectiveness of our approach by comparing with native execution environ-ments.

The remainder of this thesis is organized as follows. Chapter 2 introduces the basic of OpenCL and system virtualization, and Chapter 3 introduces the system design and implementation of OpenCL support in KVM. The performance evaluation are presented in Chapter 4. Chapter 5 introduces the related work. The conclusions and future work are drawn in Chapter 6.

(15)

Chapter 2 Background

In this chapter, basic materials for comprehending the thesis are explained, including the fundamentals of OpenCL, system virtualization overview, I/O virtualization, and the under-lying framework of this work, Kernel-based Virtual Machine (KVM). For I/O virtualiza-tion, we will focus on the current mechanisms about how to virtualize GPU functionalities. The related work will be discussed in Chapter 5.

2.1 Introduction to OpenCL

OpenCL (Open Computing Language) is an open industry standard for general-purpose parallel programming of heterogeneous systems. OpenCL is a framework that includes a language, API, libraries and a runtime system to provide a unified programming environ-ment for software developers to leverage the computing power of heterogeneous processing devices such as CPUs, GPUs, DSPs, and Cell/B.E. processors. Using OpenCL, program-mers can write portable code efficiently with the hardware-related details being exposed by the OpenCL runtime environment. The Khronos group proposed OpenCL 1.0 specification in December, 2008. The current version, OpenCL 1.1 [18], was announced in June, 2010.

(16)

2.1.1 OpenCL Hierarchy Models

The architecture of OpenCL is divided into four hierarchy models: platform model, mem-ory model, execution model, and programming model. In this section, we will briefly in-troduce the definition and relation between each hierarchy. Detail information of OpenCL can be obtained from the OpenCL Specification [18].

2.1.2 Platform Model

Figure 2.1 defines the platform model of OpenCL. The model includes a host connected to one or more OpenCL devices. An OpenCL device is consisted of one or more compute units (CUs) which are further composed of one or more processing elements (PEs). The processing elements are the smallest unit of computation on a device.

An OpenCL application is designed under such the host-device model. The application dispatches jobs (the workloads that will be processed by devices) from the host to devices, and the jobs are executed by processing elements within a device. The computation result will be transferred back to the host after execution completed. The processing elements within a compute element execute a single stream of instructions in a single instruction, multiple data (SIMD) or a single program, multiple data (SPMD) manner. SIMD and SPMD which are related to OpenCL Programming Model will be discussed in Section 2.1.5.

2.1.3 Execution Model

Execution of an OpenCL program is composed of two parts: the host part that executes on the host and the kernels that execute on one or more OpenCL devices. The host defines a context for the execution of kernels and creates command-queues to operate the execution. When the host assigns a kernel to a specific device, an index space is defined to help the kernel locate the resources of the device. We will introduce these terms in the following paragraphs.

(17)

CHAPTER 2. BACKGROUND 6

Figure 2.1: OpenCL platform model (adapted from [18])

Context

The host defines a context for the execution of kernels. The context includes resources such as devices, kernels, program objects, and memory objects. Devices are the collection of objects of OpenCL devices. Kernels are the OpenCL functions that run on OpenCL devices. Program objects reference to the kernel program source code or executables. Memory objects are visible to both host and the OpenCL devices. Data manipulation by host and OpenCL devices is done under the operation of memory objects. The context is created and manipulated using functions from the OpenCL API by the host.

Command Queue

The host creates command-queues to operate the execution of the kernels. Types of mands include kernel execution commands, memory commands, and synchronization com-mands. The host inserts commands into the command-queue which will be then scheduled by the OpenCL runtime. Commands relative to each other are in either in-order execution of out-of-order execution mode. It is possible to define multiple command-queues within a single context. These queues can execute commands concurrently, so prorammers should use synchronization commands to make sure the correctness of concurrent execution of

(18)

multiple kernels. Index Space

The index space supported in OpenCL is divided into a three-level hierarchy: NDRange, work-group, and work-item. An NDRange is an N-dimensional index space, where N is one, two, or three. An NDRange is composed of work-groups. Each work-group contains several work-items, which are the most fundamental executing element of a kernel. The work-items in a given work group execute concurrently on the processing elements of a single compute unit.

A work-item is identified by a unique global identifier (ID). Each Work-group is as-signed a unique work-group ID and each work-item is asas-signed a unique local ID within a work-group. Work-groups are assigned IDs using the similar approach to that used for work-item global IDs. According to these identifiers, work-items can inentify themselves based on the global ID or by a combination of its local ID and work-group ID.

An example of NDRange index space relationships adapted from OpenCL Specifica-tion is showed in Figure 2.2. This is a two-dimensional index space in which we define

the size of NDRange (Gx, Gy), the size of each work-groups (Sx, Sy) and the global ID

offset (Fx, Fy). The total number of work-groups is the product of Gx and Gy. The size

of each work-groups is the product of Sx and Sy. The global ID (gx, gy) is defined as the

combination of the work-group ID (wx, wy), the local ID (sx, sy) and the global ID offset

(Fx, Fy):

(gx, gy) = (wx× Sx+ sx+ Fx, wy× Sy+ sy + Fy)

The number of work-groups can be computed as:

(Wx, Wy) = (Gx/Sx, Gy/Sy)

The work-group ID can be computed by a global ID and the work-group size as:

(19)

Figure 2.2: An example of NDRange index space (adapted from [18]).

A wide range of programming models can be mapped onto this execution model. OpenCL explicitly supports data- and task-parallel programming models.

2.1.4 Memory Model

There are four distinct memory regions: global, constant, local and private memory. Global memory can be used by all work-items, constant memory is a region of global memory that remains constant during kernel execution, local memory can be shared by all work-items in a work-group, and private memory can only be adapted by a single work-item. Table 2.1 describes the access abilities and limites among the host and kernels.

The host uses OpenCL APIs to create memory objects in global memory and to enqueue memory commands for manipulating these memory objects. Data transfers between the host and devices are done by explicitly copying data or by mapping and unmapping regions of a memory object. The relationship between memory regions and the platform model are described in Figure 2.3.

OpenCL uses relaxed consistency memory model. There are no guarantees of memory consistency between different work-groups. The consistency for memory objects shared between enqueued commands is promised at a synchronous point.

(20)

Table 2.1: Memory region—allocation and memory access capabilities (adapted from [18]).

Global Constant Local Private

Host Dynamic allocation Read/Write access Dynamic allocation Read/Write access Dynamic allocation No access Dynamic allocation No access Kernel No allocation Read/Write access Static allocation Read-only access Static allocation Read/Write access Static allocation Read/Write access

2.1.5 Programming Model

The OpenCL execution model supports data-parallel and task-parallel programming mod-els. Data-parallel programming model means a sequence of instructions being applied to multiple elements of data. The index space defined in the OpenCL execution model is used to indicate a work-item where to fetch the data for the computation. Programmers can define the total number of work-items along with the number of work-items to form a work-group or only the total number of work-items to specify how to access data by a unique work-item.

The OpenCL task-parallel programming model defines a model that a single instance of a kernel is executed independent of any index space. Users can exploit parallelism via the following three methods: using vector data types implemented by the device, enqueueing multiple tasks, or enqueueing native kernels developed by a programming model orthogo-nal to OpenCL.

The synchronization occurs in OpenCL in two situations. For work-items in a single work-group, a work-group barrier is useful to permit the consistency. For commands in the same context but enqueued in different queues, programmers can use command-queue barrier and/or events to perform synchronization.

(21)

Figure 2.3: Conceptual OpenCL device architecture with processing elements, compute units and devices (adapted from [18]).

2.2 System Virtual Machine

System virtualization supports multiple operating systems (OSes) executing on a single hardware platform simultaneously and shares the hardware resources between each OS. System VMs ensure the benefits such as work isolation, server consolidation, operating debugging, dynamic load balancing, etc. Thanks to the evolution of multi-core CPUs, system virtualization has become more and more useful.

In order to support multiple OSes (called guest OSes) in a system virtual machine (System VM) running on a single hardware, a hypervisor (also called virtual machine mon-itor, VMM) is responsible for managing and allocating the underlying hardware resources between guest OSes and promises that each guest OS won’t be affected by each other. Resource sharing of a system VM is in time-sharing manners similar to the time-sharing mechanisms in OS. When the control switches from one guest to another, the hypervisor has to save the current guest system state and restore the system state of the incoming guest. The guest system state contains program counter (PC), general-purpose registers, control registers, etc. In this work we will focus on system virtualization on Intel x86 (IA-32) architectures.

(22)

Figure 2.4: Native and hosted VM systems (adapted from [23]).

Native and Hosted Virtual Machines

Traditionally system virtual machines can be divided into three categories: native VMs, user-mode hosted VMs and dual-mode hosted VMs as shown in Figure 2.4 [23]. In a native VM, only the hypervisor executes in the highest privilege level defined by the system architecture. In a user-mode hosted VM, the hypervisor is constructed upon a host platform that running an existing OS called host OS. The hypervisor can take advantage of the functionalities, such as device drivers and memory management, which are provided by host OS, and thus the implementation is simplified. However, the combination of native and hosted VMs can achieve better performance than hosted VMs and also can adapt the features provided by the existing OS. This can often be achieved by entending the host OS with extra kernel modules or device drivers. Such a system is called a dual-mode hosted VM.

2.3 CPU Virtualization

To virtualize a CPU and share the resources of the processor, the hypervisor needs to tercept and handle the execution of special instructions of guest OSes. The types of

(23)

in-CHAPTER 2. BACKGROUND 12

structions in a hardware architecture are divided into innocuous instructions and sensitive instructions [26]. Sensitive instructions are those that should be intercepted and handled by the hypervisor when they are executed by a guest OS while innocuous instructions are those other than sensitive instructions. Sensitive instructions can be further divided into control- and behavior-sensitive. Control-sensitive instructions are those that provide con-trol of resources while behavior-sensitive instructions are those whose behavior or results depend on the configuration of resources. The control should be transferred to the hyper-visor when a guest system executes sensitive instuctions to avoid directly accessing the resources or change the system configuration of other guests. Such mechanism is called trap and emulation.

Popek and Goldberg defined a set of conditions sufficient for a computer architecture to support system virtualization in 1974 [26]. They intorduced privileged instructions which are defined as those that trap if the machine is in user mode and do not trap if the machine is in privileged mode. In Popek and Goldberg’s theories, an effctive system VM is con-structed if the set of sensitive instructions is a subset of the set of privileged instructions in a specific hardware architecture. If a hardware architecture meets the condition, the architec-ture can be fully virtualized. The relationship between privileged and sensitive instructions is illustrated in Figure 2.5. In x86 architectures, there is a set of instructions called critical instructions which are sensitive instructions but not belong to privileged instructions. For

example, POPF instruction pops the flag registers from a stack held in memory, but the

interrupt-enable flag is not affected since it can only be modified in privilieged mode. Such instructions in x86 architectures can not be trapped and emulated efficiently in a system virtualization environment.

There are three methods to handle critical instructions in x86 architectures: software emulation, para-virtualization and hardware-assisted virtualization. The three methods will be discussed as follows.

(24)

Privileged Non-Privileged

Sensitive

Behavior-sensitive _sensitive

Control-Innocuous

Sensitive

Figure 2.5: Types of insturctions and their relationship with respect to CPU virtualization.

Software Emulation

With software emulation, the hypervisor emulates all of the execution of instructions so the hypervisor can handle the execution of critical instructions. The guest VM can run unmodified OS but such mechanism has significant performance degradation because the emulation processes have lots of overheads. To reduce the performance impact, dynamic binary translation (DBT) is introduced to increase the speed of the hot-path and decrease the performance impact of emulation. QEMU [11] is an example of CPU emulation. Para-virtualization

For the efficiency concern, para-virtualization requires the critical instructions in the guest OSes to be substituted by hypercalls which generate a trap so that the hypervisor can re-ceive the notification and perform suitable actions, and it can execute innocuous instruc-tions as in the native environment. Para-virtualization can achieve significant performance improvement, but the requirement of modifying guest OSes is the major disadvantage. Xen [10] is an example which uses para-virtualization.

Hardware-assisted virtualization

Hardware-assisted virtualization is a hardware extension that enables efficient full virtual-ization using the help from hardware capabilities and allows the hypervisor to execute

(25)

un-CHAPTER 2. BACKGROUND 14 VM0 VMn VMM H/W VM Control Structure (VMCS) VM Entry VM Exit VMCS Configuration Apps WinXP Linux Apps

Memory and I/O Virtualization VT-x CPU0 CPUn Processors with VT-x ( or VT-i ) Ring 3 Ring 0 VMX Root Mode

Guest OSes run at intended rings

Figure 2.6: The relationship among guest OSes, the hypervisor, and hardware-assisted virtualization, using Intel VT-x as an example (adapted from [3]).

modified OSes in complete isolation. Intel and AMD proposed their x86 hardware-assisted virtualization implementations (Intel VT-x and AMD-V) in 2006. Multiple system VMs, such as KVM and Xen hardware virtual machine (Xen HVM), have added the hardware-assisted support for better performance.

Both the virtualization support provided by Intel and AMD are conceptually similar. Intel VT-x introduced two new execution modes, VMX-root-mode and non-root mode, which are orthogonal to the existing x86 provileged modes. VMX-root-mode and non-root mode are also known as non-root mode and guest mode, respectively. The hypervisor lies in root mode while guest OSes execute in guest mode, and thus guest OSes do not need to be modified. When a CPU executed in guest mode encounters a critical instruction, the CPU will switch to root mode which is called a VM exit and pass the execution to a pre-registered routine of the hypervisor, i.e. trap and emulation by hardware extension. After the emulation processed by the hypervisor, the control is switched back to a specific guest OS which is called a VM entry. A new register called virtual machine control struc-ture (VMCS) is added to record the system configuration of a specific guest OS, which is maintained by the hypervisor. The relationship among guest OSes, the hpervisor, and

(26)

Figure 2.7: Intel EPT translation details (adapted from [3]).

hardware-assisted virtualization support is shown in Figure 2.6.

Intel and AMD also proposed their virtualization support for memory management unit (MMU) to accelerate the address translation from guest virtual address to physical address with much less overhead than maintaining a shadow page table by the hypervisor. The MMU virtualization techiniques in Intel is named extended page table (Intel EPT) and AMD names it as nested page table (AMD NPT). When a guest OS tries to maintain its page table by accessing the x86 CR3 register, the hypervisor will intercept this action and substitute the page table entry with the extended/nested page table. The Intel EPT translation scheme is shown in Figure 2.7.

2.4 I/O Virtualization

Virtualization of I/O devices is more difficult than virtualization of processors or memory subsystems in a system VM. The difficulty of virtualizing I/O devices is that there are many kinds of I/O devices and the characteristics of I/O devices are much different. There are two key points of virtualizing an I/O device, including building the virtual version of the device and virtualizing the I/O activities of the device. We will briefly describe the two issues as follows.

(27)

CHAPTER 2. BACKGROUND 16 Application Hardware Operating System VMM I/O Drivers System calls

Physycal memory and I/O operations Driver calls

Figure 2.8: Major interfaces in performing an I/O action (adapted from [23]).

2.4.1 Virtualizing Devices

There are different virtualization strategies for different kinds of I/O devices. Some I/O devices such as keyboards, mouses, speakers must be dedicated to a specific guest VM or be switched between guest VMs for a long period. Such devices are called dedicated devices. For devices such as disks, it is suitable to partition the resources for multiple guest VMs, which are called partitioned devices. Some devices such as a network interface card (NIC) can be shared among guest VMs. Such devices are called shared devices.

For the different types of devices, the hypervisor has not only to maintain the virtual states for each virtual device but to intercept the interactions between physical and virtual devices. The requests from different guest VMs should be dispatched by the hypervisor in a fairly-sharing manner. The results of I/O devices should be routed by the hypervisor, and the interrupts from physical devices should be first handled by the hypervisor directly and routed to the destination guest VM.

2.4.2 Virtualizing I/O Activities

The actions of I/O processes are divided into three levels: I/O operation level, device driver level, and system call level, which are illustrated in Figure 2.8.

(28)

Virtualizing at the I/O Operation Level

The x86 architectures provide both memory-mapped I/O (MMIO) and port-mapped I/O (PMIO) for signaling the device controller or transferring the data. The hypervisor can intercept such I/O operations due to the nature of x86 privilege level. When a guest VM executes a PMIO instruction or access an MMIO space, the operation will trap into the hypervisor, and the hypervisor then performs corresponding emulation. Since a high-level I/O action may takes several I/O operations, it is extremely difficult for the hypervisor to “reverse engineer” the individual I/O operations to infer the complete I/O action. On the other hand, too much trap-and-emulation for I/O actions would cause dramatic perfor-mance degradation.

Virtualizing at the Device Driver Level

System calls such as read()orwrite() are converted by the OS into corresponding

device driver calls. If the hypervisor can intercept the invocation of these driver calls, it can directly get the information of high-level I/O action of a virtual device and redirect the calls to the corresponding physical device. This scheme requires the guest VMs to execute a modified version of a device driver which is designed for a specific hypervisor and an OS, and the virtual device driver would deliver the I/O actions actively to the hypervisor. Although the modification of the device driver results in the guest OS being aware of itself in the virtualized environment, it can extremely reduce the overhead of virtualizing I/O actions. This approach can be regarded as an I/O para-virtualization scheme at device driver level.

Virtualizing at the System Call Level

Virtualizing at the system call level means the hypervisor will handle the whole system call requests of guest VMs. To accomplish this, however, the guest OSes are required to be modified to add a mechanism to transfer the requests of guest VMs or the emulation results

(29)

by the hypervisor, typically by adding new routines at the application binary interface (ABI) level. Comparing with virtualizing at the device driver level, this scheme requires more knowledges about the internals of different guest OS kernels and has much more difficulty for implementation.

2.5 GPU Virtualization

To support the functionalities of OpenCL in virtual machine environments, it is important to virtualize graphic processing units (GPUs). Vitrualizing a GPU has unique challenges with several reasons. Firstly, GPUs are extremely complicated devices. Secondly, the hardware specification of GPUs is closed. Thirdly, GPU architectures change rapidly and dramatically across generations. Because of these challenges, it is nearly intractable to virtualize a GPU corresponding to a real modern design. Genearlly, there are three main approaches to virtualize GPUs: software emulation, API Remoting and I/O pass-through. Software Emulation

One way to virtualize a GPU is to emulate the functionalities of GPUs and to provide a virtual device and driver for guest OSes, which is used as the interface between guest OSes and the hypervisor. The architecture of the virtual GPU could remain unchanged, and the hypervisor would synthesize host graphics operations in response of the requests from virtual GPUs. VMware has proposed VMware SVGA II—a para-virtualized solution of emulating a GPU on the hosted I/O architecture [13]. The VMware SVGA II defines its common graphics stack and provides 2D and 3D rendering with the supports of OpenGL.

Since OpenCL supports multiple different kinds of heterogeneous devices, it is ex-tremely complicated to emulate such devices with different hardware architectures and provides a unified architecture for OpenCL virtualization. Thus, software emulation is not appropriate in this work.

(30)

API Remoting

Graphics APIs such as OpenGL and Direct 3D with heterogeneous programming APIs such as CUDA and OpenCL are standard, common interfaces. It is appropriate to target the virtualization layer at API level. The API call requests from guest VMs would be forwarded to the hypervisor which performs the actual invocation. Such mechanism acts like a guest VM invoking a remote procedure call (RPC) to the hypervisor.

I/O Pass-through

As described in 2.4.2, a GPU has its own I/O ports and MMIO spaces. I/O pass-through is to assign a GPU to a dedicated guest VM so that the guest VM can access the GPU as in the native environment. The hypervisor has to handle the address mapping of PMIO or MMIO spaces between virtual and physical devices, which can be done by either software mapping or hardware-assisted mechanisms such as Intel virtualization technology for directed I/O (Intel VT-d) [16]. With the hardware support, the performance of I/O pass-through has rapidly improvement.

Comparisons Between API Remoting and I/O Pass-through

According to VMware’s technical report [13], there are four primary criteria for assess-ing GPU virtualization approaches: performance, fidelity, multiplexassess-ing and interposition. Fidelity implies the consistency between virtualized and native environments, and multi-plexing and interposition imply the abilities of GPU resources sharing. Table 2.2 summa-rizes the comparisons between API Remoting and I/O pass-through based on these four criteria. I/O pass-through has better performance and fidelity than API Remoting because of the ability of its direct accessing to physical GPU and thus can adopt the device driver and runtime library which are used for the native environment. On the other hand, API Remoting requires a modified version of device driver and runtime library for transferring the API call requests/responses, and the data volume of API calls is the key point of

(31)

virtu-CHAPTER 2. BACKGROUND 20

Table 2.2: Comparisons between API Remoting and I/O pass-through based on the four criteria.

API Remoting I/O pass-through

Performance ∆ ∨

Fidelity ∆ ∨

Multiplexing ∨ ×

Interposition ∨ ×

(∨: best supported ; ∆: supported ; ×: not supported)

alization overhead. Althrough I/O pass-through is considered better in the first two criteria, the fatal weakness of I/O pass-through is that it can not share GPU resources across guest VMs—resource sharing is an essential characteristic of system virtualization. API Remot-ing, however, can share GPU resources based on the concurrent abilities at the API level.

In this work, we chooses API Remoting for our OpenCL virtualization solution ac-cording to the comparisons summarized in Table 2.2, which not only enables the OpenCL support in a virtual machine environment but ensures resource sharing across guest VMs. The design and implementation issues will be discussed in Chapter 3.

(32)

System Design and Implementation

In this chapter, the software architecture and details of the design for enabling OpenCL support in KVM are described. The virtualization framework adopted in this work is de-scribed in Section 3.1. The details of design and implementation issues are introduced in Sections 3.2 and 3.3.

3.1 KVM Introduction

Kernel-based Virtual Machine (KVM) is a full virtualization framework for Linux on x86 platform with the help of hardware-assisted virtualization. The key concept of KVM is “Linux as a hypervisor”, that is, Linux is turned into a hypervisor by adding the KVM kernel module. The comprehensive functionalities in a system VM can be adapted from the Linux kernel such as scheduler, memory management, I/O subsystems, etc. KVM leverages hardware-assisted virtualization to ensure a pure trap-and-emulation scheme of system virtualization in x86 architectures, not only allowing the execution of unmodified OSes but increasing the performance in virtualizing CPUs and MMU. KVM kernel com-ponent has been included in mainline Linux since version 2.6.20 and became the main virtualization solution in Linux kernel.

(33)

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 22 Linux Kernel User Process User Process

Guest Mode Guest Mode

QEMU I/O QEMU I/O

KVM Module

Figure 3.1: KVM overview.

3.1.1 Basic Concepts of KVM

KVM is divided into two components: a KVM kernel module which provides an abstract interface (/dev/kvm) as an entry point of accessing the functionalities of Intel VT-x or AMD-V and a process called hypervisor process that executes a guest OS and emulates I/O actions by QEMU. The hypervisor process is regarded as a normal process in the point of view of host Linux kernel. The overview of KVM is shown in Figure 3.1.

Process Model

The KVM process model is illustrated in Figure 3.2. In KVM, a guest VM is executed within the hypervisor process which provides the necessary resource virtualization for a guest OS such as CPUs, memory spaces and device modules. The hypervisor process contains N threads (N ≥ 1) for virtualizing CPUs with a dedicated thread for emulating asynchronous I/O actions, which are also known as vCPU threads and an I/O thread. The physical memory space of a guest OS is a part of the virtual memory space in the hypervisor process.

Execution Flow in vCPU View

The execution flow of vCPU threads is illustrated in Figure 3.3 which is divided into three execution modes: guest mode, kernel mode and user mode. In Intel VT, the guest mode is

(34)

CPU CPU CPU KVM Module _KernelLinux Thread Thread Thread

vCPU vCPU

Hypervisor Process Guest

Memory

Figure 3.2: KVM process model (adapted from [20]).

mapped into VMX-non-root mode and both kernel mode with user mode are mapped into VMX-root mode.

A vCPU thread in guest mode could execute guest instructions as in native environ-ments unless encountering a privileged instruction. When the vCPU thread executes a privileged instruction, the control will transfer to the KVM kernel module. The KVM ker-nel module first maintains the VMCS of the guest VM and then decides how to handle such instruction. Only a small amount of actions are processed by the kernel module, including virtual MMU management and in-kernel I/O emulation. In other cases, the control will further transfer to user mode. In user mode, the vCPU thread performs the corresponding I/O emulation or signal handling by QEMU. After the emulated operation completed, the context held by user space is updated, and then the vCPU thread switches back to guest mode.

The control transferring from guest mode to kernel mode is called a light-weight VM exit, and that from guest mode to user mode is called a heavy-weight VM exit. The per-formance of I/O emulation is highly related to the number of heavy-weight VM exits since the cost of heavy-weight VM exits is much more than that of light-weight VM exits.

(35)

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 24 Update context, raise IRQs Handle •I/O •Invalid states •… Update

guest state _{Save Host,} Load Guest State Handle •[vMMU] •[In-Kernel I/O] •… Save Guest, Load Host State Execute native guest code Run Handle Signal K e r n e l M o d e U s e r M o d e Gu e s t M o d e VM exit (with reason) VM entry

Figure 3.3: KVM execution flow in vCPU view (adapted from [20]).

I/O Virtualization

Each virtual I/O device in KVM has a set of virtual components such as I/O ports, MMIO spaces and device memory. Emulation of I/O actions is to maintain the access or event of such virtual component. Each virtual device will register its I/O ports and MMIO space handler routine when the device starts. When PMIO or MMIO actions are executed in the guest OS, the vCPU thread will trap from guest mode to user mode and lookup the record of the allocation of I/O ports or MMIO spaces to choose the corresponding I/O emulation routines. For asynchronous I/O actions such as response network packets arriving or key-board signals occurring, it should be processed with the help of the I/O thread. The I/O thread is blocked waiting for the new incoming I/O events and handles them by sending virtual interrupts to the target guest OS or emulates direct memory access (DMA) between virtual devices and the main memory space of the guest OS.

3.1.2 Virtio Framework

Virtio framework [27] is an abstraction layer over para-virtualized I/O devices in a

hyper-visor. Virtio was developed by Rusty Russel to support his own hypervisor calledlguest

(36)

im-QEMU Front-end drivers Back-end drivers Virtqueue(s) Device emulation Guest OS KVM Host Linux

Figure 3.4: Virtio architecture in KVM.

plement new para-virtualized devices by extending the common abstraction layer. Virtio framework has been included in Linux kernel since version 2.6.30.

Virtio conceptually abstracts an I/O device as front-end drivers, back-end drivers and one or more virtqueues as shown in Figure 3.4. Front-end drivers are implemented as de-vice drivers of virtual I/O dede-vices and use virtqueues to communicate with the hypervisor. Virtqueues can be regarded as shared memory spaces between guest OSes and the hyper-visor. There is a set of functions for operating virtqueues including adding/retrieving data

to/from a virtqueue (add buf/get guf), generating a trap to switch the control to the

back-end driver (kick), and enabling/disabling call-back functions (enable cb/disable cb)

which are the interrupt handling routines of the virtio device. The back-end driver in the hypervisor retrieves the data from virtqueues and then emulates the corresponding I/O em-ulation based on the data from guest OSes.

The high-level architecture of virtio in Linux kernel is illustrated in Figure 3.5. The

virtqueue and its transmission are implemented invirtio.candvirtio ring.c, and

there are a series of virtio devices such asvirtio-blk,virtio-net,virtio-pci,

etc. The object hierarchy of the virtio front-end is shown in Figure 3.6, which illustrates the fields and methods of each virtio object with the relationships between them. A virtqueue object contains the description of available operations, a pointer to the call-back function,

(37)

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 26

virtio-blk virtio-net virtio-pci virtio-balloon virtio-console

virtio

Transport

virtio back-end

drivers

Figure 3.5: High-level architecture of virtio (adapted from [17]).

virtio_driver virtio_device virtqueue virtqueue_ops virtio_config_ops structvirtio_driver{

struct device_driver driver; const struct virtio_device_id *id_table; const unsigned int *feature_table; unsigned int feature_table_size; int (* probe)(struct virtio_device *dev); void (* remove)(struct virtio_device *dev); void (* config_changed)(struct virtio_device *dev); };

structvirtio_device{ int index; struct device dev; structvirtio_device_idid; structvirtio_config_ops*config; unsigned long features(1); void *priv;

};

probe ()

structvirtqueue{

void (* callback)(struct virtqueue *vq); structvirtio_device*vdev; structvirtqueue_ops*vq_ops; void *priv;

};

structvirtqueue_ops{

int (* add_buf)(struct virtqueue *vq, struct scatterlist sg[], unsigned int out_num, unsigned int in_num, void *data); void (* kick)(struct virtqueue *vq); void (* get_buf)(struct virtqueue *vq,

unsigned int *len); void (* disable_cb)(struct virtqueue *vq); void (* enable_cb)(struct virtqueue *vq); };

structvirtio_config_ops{

void (* get)(struct virtio_device *vdev, unsigned offset, void *buf, unsigned len); void (* set)(struct virtio_device *vdev,

unsigned offset, void *buf, unsigned len); u8 (* get_status)(struct virtio_device *vdev); void (* set_status)(struct virtio_device *vdev,

u8 status); void (* reset)(struct virtio_device *vdev); struct virtqueue *(* find_vq)

(struct virtio_device *vdev, unsigned index,

void (* callback)(struct virtqueue *vq)); void (* del_vq)(struct virtio_device *vdev);

u32 (* get_features)(struct virtio_device vdev); void (* finalize_features)(struct virtio_device vdev); };

Figure 3.6: Object hierarchy of the virtio front-end (adapted from [17]).

and a pointer to the virtio device which owns this virtqueue. A virtio device object contains the fields used to describe the features and a pointer to a virtio config ops object, which is used to describe the operations that configures the device. In the device initialization

phase, the virtio driver would invoke theprobemethod to setup and new an instance of

virtio device.

In this work, we implement our API Remoting mechanism based on the virtio frame-work to perform the data communication for OpenCL virtualization. The design and im-plementation details will be discussed in the following sections.

(38)

3.2 OpenCL API Remoting in KVM

In this section, the framework of OpenCL API Remoting in KVM is described, including the software architecture, execution flow and the relationship among each software com-ponents.

3.2.1 Software Architecture

Figure 3.7 presents the architecture of OpenCL API Remoting in this work, which includes an OpenCL library specific to guest OSes, a virtual device called Virtio-CL and a thread called CL thread. The functionalities of each component are described as follows:

• Guest OpenCL library

The guest OpenCL library is response for packing OpenCL requests of user applica-tions from the guest and unpacking the results from the hypervisor. In our current implementation, the guest OpenCL library is designed as a wrapper library and per-forms basic verifications according to the OpenCL specifications such as null pointer or integer value range checking.

• Virtio-CL device

The Virtio-CL virtual device is response for data communication between the guest OS and the hypervisor. The main component of Virtio-CL is two virtqueues: one for data transmission from the guest OS to the hypervisor and the other vice versa. The Virtio-CL device can be further divided into front-end (residing in the guest OS) and back-end (residing in the hypervisor). The guest OS accesses the Virtio-CL device by the front-end driver and writes/reads OpenVirtio-CL API requests/responses via virtqueues using the corresponding driver calls. The Virtio-CL back-end driver accepts the requests from the guest OS and passes them to the CL thread. The ac-tual invocation of OpenCL API calls is done by the CL thread. The virtqueues can

(39)

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 28 KVM Host Linux QEMU Back-end drivers Device emulation Guest OS Front-end drivers Virtqueue(s) guest OpenCL library

vCPU CL I/O … …… Guest VM #1 Guest VM #2 Vender specific OpenCL Runtime CL

Figure 3.7: Architecture of OpenCL API Remoting in KVM.

be regarded as shared memory spaces which can be modeled as device memory of Virtio-CL in the point of view of the guest OS.

• CL thread

The CL thread is dedicated for accessing vendor-specific OpenCL runtimes in user mode. The CL thread reconstructs the requests, performs the actual invocation of OpenCL API calls, and then passes the results back to the guest OS via the virtqueue used for response transmission. Since the processing time for each OpenCL API call is different, it is appropriate to handle OpenCL requests by an individual thread in-stead of extending the functionalities of existing I/O thread in the hypervisor process in order to let the execution of OpenCL APIs be independent from the functionalities of the I/O thread.

An alternative approach instead of creating an CL thread to process OpenCL re-quests is to implement the actual invocation of OpenCL API calls in the handler of the vCPU thread and configure as multiple vCPUs for each guest VM. However, both of the two approaches require a buffer (CLRequestQueue as shown in Figure 3.8) to

(40)

store segmented OpenCL requests in case the size of an OpenCL request is larger than the size of the virtqueue for requrests (VQ REQ in Figure 3.8). In addition, the latter approach has to handle the synchronization of the virtuqueue for response (VQ RESP as shown in Figure 3.8) between different vCPU threads, while the for-mer approach handles the synchronization of CLRequestQueue between the vCPU thread and the CL thread.

As in Figure 3.7, the architecture of our OpenCL virtualization framework can be mod-eled as multiple processes accessing the OpenCL resources in the native environment. Be-haviors of the execution depend on the implementation of the vendor-specific OpenCL runtime.

3.2.2 Execution Flow

Figure 3.8 illustrates the execution flow of OpenCL API calls, and the processing steps are as follows:

1. A process running in a guest OS invokes an OpenCL API function.

2. The guest OpenCL library (libOpenCL.a) first performs basic verifications of the parameters according to the OpenCL API specifications. If the verifications failed, it returns the corresponding error code to the user-space process.

3. After the parameter verification, the guest OpenCL library sets up the data needed to

be transferred and executes system callfsync().

4. Thefsync()system call adds the data to the virtqueue for requests (VQ REQ) and

then invokes the kick()method of VQ REQ to generate a VM exit. The control

of current vCPU thread is transferred to a corresponding handler routine of VM exit in user-mode.

(41)

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 30 Check Parameters Setup Transmitted Parameters fsync() read() Result Reconstruction (return/side effects) Add request buffer to VQ_REQ Wake up processes in waiting queue Rebuild request and store to CLRequestQueue (Blocking wait) Get request from CLRequestQueue

Invoke OpenCL API

Copy results to VQ_RESP CLRequestQueue Hypervisor CL Thread Send vIRQ ISR of VQ_RESP Vender Specific OpenCL Runtime Guest libOpenCL.a

Guest CLDEV Driver

VQ_REQ VQ_RESP vCPU Thread Wait and concatenate data segment kick() Notify CL thread (VQ_RESP is available ) failed successed Data copy MMAP

OpenCL API Function Call

kick() Guest process Virtqueues User-space memory

Figure 3.8: Execution flow of OpenCL API Remoting.

5. In user-mode, the handler routine copies the OpenCL API request to CLRequestQueue, which is a shared memory space between vCPU threads and the CL thread. After the copy process completes, the control is transferred back to the guest OpenCL library. If the data size of the request is larger than the size of VQ REQ, the request is di-vided into segments and the execution jumps to step 3 and repeats step 3–5 until all the segments are transferred. If the whole data segments of the OpenCL request are transferred, the handler will signal the CL thread there is an incoming OpenCL API request and the CL thread starts processing (see step 7).

6. After the request is passed to the hypervisor, the guest OpenCL library invokes

read() system call which is blocked waiting for the result data and returns after

the whole result data have been transferred.

7. The CL thread waits on a blocking queue until receiving the signal from the handler routine of VM exit in step 5. The CL thread then unpacks the request and performs the actual invocation.

(42)

8. After the completion of the OpenCL API invocation, the CL thread packs the result data, copies them to the virtqueue for responses (VQ RESP), and then notifies the guest OS by sending a virtual interrupt of the Virtio-CL device.

9. Once the guest OS receives the virtual interrupt from the Virtio-CL device, the cor-responding interrupt service routine (ISR) wakes up the process waiting for response data of the OpenCL API call, and the result data are copied from VQ RESP to the user space memory. Steps 8 and 9 repeat until all of result data are transferred.

10. Once the read() system call returns, the guest OpenCL library can unpack and

rebuild the return value and/or side effects of parameters. The execution of OpenCL API function has been completed.

3.2.3 Implementation Details

In this section, we present the implementation details of the proposed virtualization frame-work, including the guest OpenCL library, device driver, data transmissions and synchro-nization points.

Guest OpenCL Library and Device Driver

The device driver of Virtio-CL implementsopen(),close(),mmap(),fsync()and

read()system calls. mmap()andfsync()are used for transferring OpenCL requests,

andread()system call is used for retrieving the response data. The guest OpenCL library

uses those system calls to communicate with the hypervisor.

Before a user process start using the resources of OpenCL, the process has to

explic-itly invoke a specific function—clEnvInit() to perform resource initialization such

as opening the Virtio-CL device and preparing the memory storages for data

transmis-sions. Another additional function—clEnvExit() has to be invoked to release the

(43)

pthread_mutex_t clReqMutex = PTHREAD_MUTEX_INITIALIZER;

pthread_cond_t clReqCond = PTHREAD_COND_INITIALIZER;

...

pthread_mutex_lock( &clReqMutex );

if( /* OpenCL request is not ready */ ) {

pthread_cond_wait( &clReqCond, &clReqMutex ); }

/* pop an request node from CLRequestQueue */ ...

Figure 3.9: 1st synchronization point (CL thread).

pthread_mutex_lock( &clReqMutex );

/* Copy the request data segment to CLRequestQueue */ if( /* OpenCL request is ready */ ) {

pthread_cond_signal( &clReqCond ); }

pthread_mutex_unlock( &clReqMutex ); ...

Figure 3.10: 1stsynchronization point (VM exit handler).

to be executed in the virtualized platform have to be revised with minor changes—adding

clEnvInit()andclEnvExit()at the beginning and the end of an OpenCL program,

respectively. Nevertheless, the revisions can be done automatically by a simple OpenCL parser. An alternative approach that avoids such revisions is possible, but it involves a complex virtualization architecture and results in much more overhead. Further discus-sions about the alternative implementation will be made in Section 3.3.5.

Synchronization Points

There are two synchronization points in the execution flow: one occurs in the CL thread which waits for a completed OpenCL request, and the other occurs in the implementation

(44)

ssize_t cldev_read( struct file *filp, char __user *buf, size_t count, loff_t *f_pos ) {

do {

/* put the current process into waiting queue */ schedule();

/* the process will be blocked until response data segment is ready */ copy_to_user( buf, kernel_buffer, size_data_seg );

/* notify the hypervisor that VQ_RESP is available */ } while( /* transmission not completed */ )

}

Figure 3.11: 2ndsynchronization point (readsystem call).

1st_{synchronization point is handled by pthread mutexes and condition variables. When the}

data segments of an OpenCL request are not completely transmitted to CLRequestQueue,

the CL thread invokespthread cond wait()to wait for the signal from the handler

routine of VQ REQ. The pseudo code of the 1st synchronization point is described in

Fig-ures 3.9 and 3.10. The 2nd synchronization point works as follows. Theread()system

call first puts itself to the waiting queue in the kernel space to be blocked. The blocked process will resume execution and retrieve the response data segment from VQ RESP

af-ter the virtual inaf-terrupt of Virtio-CL is raised. The pseudo code of the 2nd synchronous

point is described in Figure 3.11.

The two synchronization mechanisms not only ensure the data correctness among the transmission processes but also allow the vCPU resource to be used by other processes in the guest OS.

Data Transimssions

In Figure 3.8, the green and orange arrows represent the paths of data transmissions. The green arrows indicate the data copies among the guest user-space, the guest kernel-space and the host user-space. Since the data transmission of OpenCL requests is processed ac-tively by the guest OpenCL library, the data copy from the guest user-space to the guest

(45)

kernel-space can be bypassed by preparing a memory-mapped space (indicated by the or-ange arrow). The data copy can not be eliminated in the opposite direction because the guest OpenCL library is waiting for results from the hypervisor passively. Thus, there are two data copies for a request and three copies for a response.

3.3 Related Issues of Implementation

In this section, the related issues of design and implementation of supporting OpenCL in KVM, including the size of virtqueues, data coherence between guest OSes and the hypervisor, etc., will be discussed.

3.3.1 Size of Virtqueues

Virtqueues are shared memory space between guest OSes and the hypervisor, and there are a series of address translation mechanisms for both the guest OS and the QEMU part to access the data from virtqueues. On one hand, the size of virtqueues directly affects the virtualization overhead because larger size of virtqueues results in fewer number of the heavyweight VM exits. On the other hand, since the total virtual memory space of the hypervisor process is limited (4 Gigabytes in 32-bit host Linux), the size of virtqueue is also limited.

In our framework, the size of VQ REQ and VQ RESP are both 256 Kilobytes (64 pages) according to the configurations of the existing Virtio devices: blk and virtio-net. virtio-blk has one virtqueue which is 512 Kilobytes (128 pages), and virtio-net has

two virtqueues and both of them are 1024 Kilobytes (256 pages). A request (or a response) will be partitioned into multiple 256-Kilobyte segments and then transferred sequentially if the data size of the request (or the response) exceeds the virtuqueue size.

(46)

3.3.2 Signal Handling

A user OpenCL process may suffers a segmentation fault while calling OpenCL APIs. In the native environments, it will cause the end of process execution. In our virtualization framework, the situation should be handled by the CL thread carefully, or the hypervisor

process will crash. The CL thread has to build handler routines for signals likeSIGSEGV

to recover the thread from such signal and return the corresponding error messages to the guest OpenCL library.

3.3.3 Data Structures Related to Runtime Implementation

Figure 3.12 lists the data structures which are related to the runtime implementation in OpenCL. The data structures are maintained by the vendor-specific OpenCL runtime, and the users only can access them by the corresponding OpenCL functions. For example,

when a process invokesclGetDeviceIDs()with a pointer of an array ofcl device id

as a parameter, the function fills each entry of the array with a pointer to the object of an OpenCL device. In our framework, the CL thread is used to access the actual OpenCL runtime in user-mode and consume the parameters provided by the guest process. How-ever, the CL thread can not directly access the array since it is in a virtual address space of the guest process. Thus, we have to construct a “shadow mapping” of the guest array: the CL thread allocates a “shadow array” and maintains a mapping table between the array in the guest and the shadow array. Figure 3.13 illustrates an example of the shadow mapping

mechanism. When a guest process invokes clGetDeviceIDs(), the CL thread

allo-cates a “shadow array” and creates a new entry in the mapping table. The entry records the pointer type, the process identifier (PID) of the guest process and the mapping between the host address and the guest address. The CL thread then performs the actual OpenCL function invocation with the shadow array and transfers all the contents of the shadow ar-ray to the guest process after the function invocation completes to ensure that the arar-ray of

(47)

/* /usr/local/cuda/include/CL/cl.h (NVIDIA OpenCL SDK) */ typedef struct _cl_platform_id * cl_platform_id; typedef struct _cl_device_id * cl_device_id; typedef struct _cl_context * cl_context; typedef struct _cl_command_queue * cl_command_queue; typedef struct _cl_mem * cl_mem;

typedef struct _cl_program * cl_program; typedef struct _cl_kernel * cl_kernel; typedef struct _cl_event * cl_event; typedef struct _cl_sampler * cl_sampler;

Figure 3.12: The data structures related to the OpenCL runtime implementation.

KVM Host Linux Guest VM QEMU Guest OS Virtqueue(s) Vendor Specific OpenCL Runtime 0x3000 Mapping Table 0x1234 Process P1 (PID = 1680) …… Array of cl_device_id

Array of cl_device_id Type

CL_DEVICE_ID_PTR

PID

1680

Host Guest

0x3000 0x1234

Figure 3.13: Shadow mapping mechanism.

When the guest process invokes an OpenCL function that uses acl device id

ob-ject, value of the cl device id object can be directly used because it is a pointer to

the host address space. When the guest process invokes an OpenCL function that uses an

array ofcl device idobjects, the CL thread will lookup the mapping table to find the

address of the shadow array for the actual function invocation. After the guest process

in-vokesclEnvExit(), the CL thread deletes the entries related to the process and release

(48)

cl_mem clCreateBuffer( cl_context context, cl_mem_flags flags, size_t size, void * host_ptr, cl_int * errcode_ret );

Figure 3.14: Prototype of clCreateuffer().

KVM Host Linux Guest VM QEMU Virtqueue(s) Vendor Specific OpenCL Runtime Process P1 (PID = 1680) ……

Shadow memory space Current implementation cannot handle

memory coherence problem

cl_mem obj = clCreateBuffer( context,

CL_MEM_USE_HOST_PTR, 1000000,

host_ptr, &status );

host_ptr

Figure 3.15: Memory coherence problem in clCreateBuffer().

3.3.4 OpenCL Memory Objects

OpenCL memory objects (cl mem) is an abstraction of global device memory that can

serve as data containers for computation. A process can use clCreateBuffer() to

create a memory object which can be regarded as an one-dimensional buffer. Figure 3.14

shows the prototype ofclCreateBuffer(), wherecontextis a valid OpenCL

con-text associated with this memory object, flags is used to specify allocation and usage

information—Table 3.1 describes the possible values forflags, sizeindicates the size

of the memory object in bytes,host ptris a pointer to the buffer data that may already

在KVM虛擬機器中支援OpenCL圖形加速裝置

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

在 KVM 虛擬機器中支援 OpenCL 圖形加速裝置

Enabling OpenCL Support for GPGPU in Kernel-based Virtual Machine

研 究 生：田璨榮

指導教授：游逸平 教授

Enabling OpenCL Support for GPGPU in Kernel-based Virtual Machine

研 究 生：田璨榮 Student：Tsan-Rong Tian

指導教授：游逸平 博士 Advisor：Dr. Yi-Ping You

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

在 KVM 虛擬機器中支援 OpenCL 圖形加速裝置

學生：田璨榮

指導教授：游逸平 博士

國立交通大學資訊科學與工程所碩士班

摘

要

Student：Tsan-Rong Tian

Advisors：Dr. Yi-Ping You

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

致 謝

List of Figures

List of Tables

Introduction

1.1

Motivation

1.2

Thesis Overview

Chapter 2

Background

2.1

Introduction to OpenCL

2.1.1

OpenCL Hierarchy Models

2.1.2

Platform Model

2.1.3

Execution Model

2.1.4

Memory Model

2.1.5

Programming Model

2.2

System Virtual Machine

2.3

CPU Virtualization

2.4

I/O Virtualization

2.4.1

Virtualizing Devices

2.4.2

Virtualizing I/O Activities

2.5

GPU Virtualization

System Design and Implementation

3.1

KVM Introduction

3.1.1

Basic Concepts of KVM

3.1.2

Virtio Framework

3.2

OpenCL API Remoting in KVM

3.2.1

Software Architecture

3.2.2

Execution Flow

研究生：田璨榮

指導教授：游逸平教授

研究生：田璨榮 Student：Tsan-Rong Tian

指導教授：游逸平博士 Advisor：Dr. Yi-Ping You

國立交通大學

資訊科學與工程研究所

碩士論文

指導教授：游逸平博士

致謝