System Design and Implementation - 在KVM虛擬機器中支援OpenCL圖形加速裝置

In this chapter, the software architecture and details of the design for enabling OpenCL support in KVM are described. The virtualization framework adopted in this work is de-scribed in Section 3.1. The details of design and implementation issues are introduced in Sections 3.2 and 3.3.

3.1 KVM Introduction

Kernel-based Virtual Machine (KVM) is a full virtualization framework for Linux on x86 platform with the help of hardware-assisted virtualization. The key concept of KVM is

“Linux as a hypervisor”, that is, Linux is turned into a hypervisor by adding the KVM kernel module. The comprehensive functionalities in a system VM can be adapted from the Linux kernel such as scheduler, memory management, I/O subsystems, etc. KVM leverages hardware-assisted virtualization to ensure a pure trap-and-emulation scheme of system virtualization in x86 architectures, not only allowing the execution of unmodified OSes but increasing the performance in virtualizing CPUs and MMU. KVM kernel com-ponent has been included in mainline Linux since version 2.6.20 and became the main virtualization solution in Linux kernel.

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 22

Linux Kernel User

Process

User Process

Guest Mode Guest Mode

QEMU I/O QEMU I/O

KVM Module

Figure 3.1: KVM overview.

3.1.1 Basic Concepts of KVM

KVM is divided into two components: a KVM kernel module which provides an abstract interface (/dev/kvm) as an entry point of accessing the functionalities of Intel VT-x or AMD-V and a process called hypervisor process that executes a guest OS and emulates I/O actions by QEMU. The hypervisor process is regarded as a normal process in the point of view of host Linux kernel. The overview of KVM is shown in Figure 3.1.

Process Model

The KVM process model is illustrated in Figure 3.2. In KVM, a guest VM is executed within the hypervisor process which provides the necessary resource virtualization for a guest OS such as CPUs, memory spaces and device modules. The hypervisor process contains N threads (N ≥ 1) for virtualizing CPUs with a dedicated thread for emulating asynchronous I/O actions, which are also known as vCPU threads and an I/O thread. The physical memory space of a guest OS is a part of the virtual memory space in the hypervisor process.

Execution Flow in vCPU View

The execution flow of vCPU threads is illustrated in Figure 3.3 which is divided into three execution modes: guest mode, kernel mode and user mode. In Intel VT, the guest mode is

CPU CPU CPU KVM Module Linux

Kernel Thread Thread Thread

vCPU vCPU

Hypervisor Process Guest

Memory

Figure 3.2: KVM process model (adapted from [20]).

mapped into VMX-non-root mode and both kernel mode with user mode are mapped into VMX-root mode.

A vCPU thread in guest mode could execute guest instructions as in native environ-ments unless encountering a privileged instruction. When the vCPU thread executes a privileged instruction, the control will transfer to the KVM kernel module. The KVM ker-nel module first maintains the VMCS of the guest VM and then decides how to handle such instruction. Only a small amount of actions are processed by the kernel module, including virtual MMU management and in-kernel I/O emulation. In other cases, the control will further transfer to user mode. In user mode, the vCPU thread performs the corresponding I/O emulation or signal handling by QEMU. After the emulated operation completed, the context held by user space is updated, and then the vCPU thread switches back to guest mode.

The control transferring from guest mode to kernel mode is called a light-weight VM exit, and that from guest mode to user mode is called a heavy-weight VM exit. The per-formance of I/O emulation is highly related to the number of heavy-weight VM exits since the cost of heavy-weight VM exits is much more than that of light-weight VM exits.

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 24

guest state Save Host, Load Guest

Kernel ModeUser ModeGuest Mode

VM exit (with reason) VM entry

Figure 3.3: KVM execution flow in vCPU view (adapted from [20]).

I/O Virtualization

Each virtual I/O device in KVM has a set of virtual components such as I/O ports, MMIO spaces and device memory. Emulation of I/O actions is to maintain the access or event of such virtual component. Each virtual device will register its I/O ports and MMIO space handler routine when the device starts. When PMIO or MMIO actions are executed in the guest OS, the vCPU thread will trap from guest mode to user mode and lookup the record of the allocation of I/O ports or MMIO spaces to choose the corresponding I/O emulation routines. For asynchronous I/O actions such as response network packets arriving or key-board signals occurring, it should be processed with the help of the I/O thread. The I/O thread is blocked waiting for the new incoming I/O events and handles them by sending virtual interrupts to the target guest OS or emulates direct memory access (DMA) between virtual devices and the main memory space of the guest OS.

3.1.2 Virtio Framework

Virtio framework [27] is an abstraction layer over para-virtualized I/O devices in a hyper-visor. Virtio was developed by Rusty Russel to support his own hypervisor calledlguest and adopted by KVM for its I/O para-virtualization solution. With virtio, it is easy to

im-QEMU

Front-end drivers

Back-end drivers Virtqueue(s)

Device emulation Guest OS

KVM Host Linux

Figure 3.4: Virtio architecture in KVM.

plement new para-virtualized devices by extending the common abstraction layer. Virtio framework has been included in Linux kernel since version 2.6.30.

Virtio conceptually abstracts an I/O device as front-end drivers, back-end drivers and one or more virtqueues as shown in Figure 3.4. Front-end drivers are implemented as de-vice drivers of virtual I/O dede-vices and use virtqueues to communicate with the hypervisor.

Virtqueues can be regarded as shared memory spaces between guest OSes and the hyper-visor. There is a set of functions for operating virtqueues including adding/retrieving data to/from a virtqueue (add buf/get guf), generating a trap to switch the control to the back-end driver (kick), and enabling/disabling call-back functions (enable cb/disable cb) which are the interrupt handling routines of the virtio device. The back-end driver in the hypervisor retrieves the data from virtqueues and then emulates the corresponding I/O em-ulation based on the data from guest OSes.

The high-level architecture of virtio in Linux kernel is illustrated in Figure 3.5. The virtqueue and its transmission are implemented invirtio.candvirtio ring.c, and there are a series of virtio devices such asvirtio-blk,virtio-net,virtio-pci, etc. The object hierarchy of the virtio front-end is shown in Figure 3.6, which illustrates the fields and methods of each virtio object with the relationships between them. A virtqueue object contains the description of available operations, a pointer to the call-back function,

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 26

virtio-blk virtio-net virtio-pci virtio-balloon virtio-console

virtio

Transport

virtio back-end

drivers

Figure 3.5: High-level architecture of virtio (adapted from [17]).

virtio_driver

int (* probe)(struct virtio_device *dev);

void (* remove)(struct virtio_device *dev);

void (* config_changed)(struct virtio_device *dev);

structvirtio_device{ };

void (* callback)(struct virtqueue *vq);

structvirtio_device*vdev;

structvirtqueue_ops*vq_ops;

void *priv;

};

structvirtqueue_ops{

int (* add_buf)(struct virtqueue *vq, struct scatterlist sg[], unsigned int out_num, unsigned int in_num, void *data);

void (* kick)(struct virtqueue *vq);

void (* get_buf)(struct virtqueue *vq, unsigned int *len);

void (* disable_cb)(struct virtqueue *vq);

void (* enable_cb)(struct virtqueue *vq);

};

structvirtio_config_ops{

void (* get)(struct virtio_device *vdev, unsigned offset, void *buf, unsigned len);

void (* set)(struct virtio_device *vdev, unsigned offset, void *buf, unsigned len);

u8 (* get_status)(struct virtio_device *vdev);

void (* set_status)(struct virtio_device *vdev, u8 status);

void (* reset)(struct virtio_device *vdev);

struct virtqueue *(* find_vq)

(struct virtio_device *vdev, unsigned index,

void (* callback)(struct virtqueue *vq));

void (* del_vq)(struct virtio_device *vdev);

u32 (* get_features)(struct virtio_device vdev);

void (* finalize_features)(struct virtio_device vdev);

};

Figure 3.6: Object hierarchy of the virtio front-end (adapted from [17]).

and a pointer to the virtio device which owns this virtqueue. A virtio device object contains the fields used to describe the features and a pointer to a virtio config ops object, which is used to describe the operations that configures the device. In the device initialization phase, the virtio driver would invoke theprobemethod to setup and new an instance of virtio device.

In this work, we implement our API Remoting mechanism based on the virtio frame-work to perform the data communication for OpenCL virtualization. The design and im-plementation details will be discussed in the following sections.

3.2 OpenCL API Remoting in KVM

In this section, the framework of OpenCL API Remoting in KVM is described, including the software architecture, execution flow and the relationship among each software com-ponents.

3.2.1 Software Architecture

Figure 3.7 presents the architecture of OpenCL API Remoting in this work, which includes an OpenCL library specific to guest OSes, a virtual device called Virtio-CL and a thread called CL thread. The functionalities of each component are described as follows:

• Guest OpenCL library

The guest OpenCL library is response for packing OpenCL requests of user applica-tions from the guest and unpacking the results from the hypervisor. In our current implementation, the guest OpenCL library is designed as a wrapper library and per-forms basic verifications according to the OpenCL specifications such as null pointer or integer value range checking.

• Virtio-CL device

The Virtio-CL virtual device is response for data communication between the guest OS and the hypervisor. The main component of Virtio-CL is two virtqueues: one for data transmission from the guest OS to the hypervisor and the other vice versa.

The Virtio-CL device can be further divided into front-end (residing in the guest OS) and back-end (residing in the hypervisor). The guest OS accesses the Virtio-CL device by the front-end driver and writes/reads OpenVirtio-CL API requests/responses via virtqueues using the corresponding driver calls. The Virtio-CL back-end driver accepts the requests from the guest OS and passes them to the CL thread. The ac-tual invocation of OpenCL API calls is done by the CL thread. The virtqueues can

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 28

Figure 3.7: Architecture of OpenCL API Remoting in KVM.

be regarded as shared memory spaces which can be modeled as device memory of Virtio-CL in the point of view of the guest OS.

• CL thread

The CL thread is dedicated for accessing vendor-specific OpenCL runtimes in user mode. The CL thread reconstructs the requests, performs the actual invocation of OpenCL API calls, and then passes the results back to the guest OS via the virtqueue used for response transmission. Since the processing time for each OpenCL API call is different, it is appropriate to handle OpenCL requests by an individual thread in-stead of extending the functionalities of existing I/O thread in the hypervisor process in order to let the execution of OpenCL APIs be independent from the functionalities of the I/O thread.

An alternative approach instead of creating an CL thread to process OpenCL re-quests is to implement the actual invocation of OpenCL API calls in the handler of the vCPU thread and configure as multiple vCPUs for each guest VM. However, both of the two approaches require a buffer (CLRequestQueue as shown in Figure 3.8) to

store segmented OpenCL requests in case the size of an OpenCL request is larger than the size of the virtqueue for requrests (VQ REQ in Figure 3.8). In addition, the latter approach has to handle the synchronization of the virtuqueue for response (VQ RESP as shown in Figure 3.8) between different vCPU threads, while the for-mer approach handles the synchronization of CLRequestQueue between the vCPU thread and the CL thread.

As in Figure 3.7, the architecture of our OpenCL virtualization framework can be mod-eled as multiple processes accessing the OpenCL resources in the native environment. Be-haviors of the execution depend on the implementation of the vendor-specific OpenCL runtime.

3.2.2 Execution Flow

Figure 3.8 illustrates the execution flow of OpenCL API calls, and the processing steps are as follows:

1. A process running in a guest OS invokes an OpenCL API function.

2. The guest OpenCL library (libOpenCL.a) first performs basic verifications of the parameters according to the OpenCL API specifications. If the verifications failed, it returns the corresponding error code to the user-space process.

3. After the parameter verification, the guest OpenCL library sets up the data needed to be transferred and executes system callfsync().

4. Thefsync()system call adds the data to the virtqueue for requests (VQ REQ) and then invokes the kick()method of VQ REQ to generate a VM exit. The control of current vCPU thread is transferred to a corresponding handler routine of VM exit in user-mode.

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 30

Figure 3.8: Execution flow of OpenCL API Remoting.

5. In user-mode, the handler routine copies the OpenCL API request to CLRequestQueue, which is a shared memory space between vCPU threads and the CL thread. After the copy process completes, the control is transferred back to the guest OpenCL library.

If the data size of the request is larger than the size of VQ REQ, the request is di-vided into segments and the execution jumps to step 3 and repeats step 3–5 until all the segments are transferred. If the whole data segments of the OpenCL request are transferred, the handler will signal the CL thread there is an incoming OpenCL API request and the CL thread starts processing (see step 7).

6. After the request is passed to the hypervisor, the guest OpenCL library invokes read() system call which is blocked waiting for the result data and returns after the whole result data have been transferred.

7. The CL thread waits on a blocking queue until receiving the signal from the handler routine of VM exit in step 5. The CL thread then unpacks the request and performs the actual invocation.

8. After the completion of the OpenCL API invocation, the CL thread packs the result data, copies them to the virtqueue for responses (VQ RESP), and then notifies the guest OS by sending a virtual interrupt of the Virtio-CL device.

9. Once the guest OS receives the virtual interrupt from the Virtio-CL device, the cor-responding interrupt service routine (ISR) wakes up the process waiting for response data of the OpenCL API call, and the result data are copied from VQ RESP to the user space memory. Steps 8 and 9 repeat until all of result data are transferred.

10. Once the read() system call returns, the guest OpenCL library can unpack and rebuild the return value and/or side effects of parameters. The execution of OpenCL API function has been completed.

3.2.3 Implementation Details

In this section, we present the implementation details of the proposed virtualization frame-work, including the guest OpenCL library, device driver, data transmissions and synchro-nization points.

Guest OpenCL Library and Device Driver

The device driver of Virtio-CL implementsopen(),close(),mmap(),fsync()and read()system calls. mmap()andfsync()are used for transferring OpenCL requests, andread()system call is used for retrieving the response data. The guest OpenCL library uses those system calls to communicate with the hypervisor.

Before a user process start using the resources of OpenCL, the process has to explic-itly invoke a specific function—clEnvInit() to perform resource initialization such as opening the Virtio-CL device and preparing the memory storages for data transmis-sions. Another additional function—clEnvExit() has to be invoked to release the re-sources of OpenCL before the process finishes execution. Therefore, OpenCL programs

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 32

pthread_mutex_t clReqMutex = PTHREAD_MUTEX_INITIALIZER;

pthread_cond_t clReqCond = PTHREAD_COND_INITIALIZER;

...

pthread_mutex_lock( &clReqMutex );

if( /* OpenCL request is not ready */ ) {

pthread_cond_wait( &clReqCond, &clReqMutex );

}

/* pop an request node from CLRequestQueue */

...

Figure 3.9: 1^st synchronization point (CL thread).

pthread_mutex_lock( &clReqMutex );

/* Copy the request data segment to CLRequestQueue */

if( /* OpenCL request is ready */ ) { pthread_cond_signal( &clReqCond );

}

pthread_mutex_unlock( &clReqMutex );

...

Figure 3.10: 1^stsynchronization point (VM exit handler).

to be executed in the virtualized platform have to be revised with minor changes—adding clEnvInit()andclEnvExit()at the beginning and the end of an OpenCL program, respectively. Nevertheless, the revisions can be done automatically by a simple OpenCL parser. An alternative approach that avoids such revisions is possible, but it involves a complex virtualization architecture and results in much more overhead. Further discus-sions about the alternative implementation will be made in Section 3.3.5.

Synchronization Points

There are two synchronization points in the execution flow: one occurs in the CL thread which waits for a completed OpenCL request, and the other occurs in the implementation of theread()system call which waits for the result data processed by the CL thread. The

ssize_t cldev_read( struct file *filp, char __user *buf, size_t count, loff_t *f_pos ) {

do {

/* put the current process into waiting queue */

schedule();

/* the process will be blocked until response data segment is ready */

copy_to_user( buf, kernel_buffer, size_data_seg );

/* notify the hypervisor that VQ_RESP is available */

} while( /* transmission not completed */ ) }

Figure 3.11: 2^ndsynchronization point (readsystem call).

1^stsynchronization point is handled by pthread mutexes and condition variables. When the data segments of an OpenCL request are not completely transmitted to CLRequestQueue, the CL thread invokespthread cond wait()to wait for the signal from the handler routine of VQ REQ. The pseudo code of the 1^st synchronization point is described in Fig-ures 3.9 and 3.10. The 2^nd synchronization point works as follows. Theread()system call first puts itself to the waiting queue in the kernel space to be blocked. The blocked process will resume execution and retrieve the response data segment from VQ RESP af-ter the virtual inaf-terrupt of Virtio-CL is raised. The pseudo code of the 2^nd synchronous point is described in Figure 3.11.

The two synchronization mechanisms not only ensure the data correctness among the transmission processes but also allow the vCPU resource to be used by other processes in the guest OS.

Data Transimssions

In Figure 3.8, the green and orange arrows represent the paths of data transmissions. The green arrows indicate the data copies among the guest user-space, the guest kernel-space and the host user-space. Since the data transmission of OpenCL requests is processed ac-tively by the guest OpenCL library, the data copy from the guest user-space to the guest

CHAPTER 3. SYSTEM DESIGN AND IMPLEMENTATION 34

kernel-space can be bypassed by preparing a memory-mapped space (indicated by the or-ange arrow). The data copy can not be eliminated in the opposite direction because the guest OpenCL library is waiting for results from the hypervisor passively. Thus, there are two data copies for a request and three copies for a response.

3.3 Related Issues of Implementation

In this section, the related issues of design and implementation of supporting OpenCL in KVM, including the size of virtqueues, data coherence between guest OSes and the hypervisor, etc., will be discussed.

3.3.1 Size of Virtqueues

Virtqueues are shared memory space between guest OSes and the hypervisor, and there are a series of address translation mechanisms for both the guest OS and the QEMU part to access the data from virtqueues. On one hand, the size of virtqueues directly affects the virtualization overhead because larger size of virtqueues results in fewer number of the heavyweight VM exits. On the other hand, since the total virtual memory space of the hypervisor process is limited (4 Gigabytes in 32-bit host Linux), the size of virtqueue is also limited.

In our framework, the size of VQ REQ and VQ RESP are both 256 Kilobytes (64 pages) according to the configurations of the existing Virtio devices: blk and virtio-net. virtio-blk has one virtqueue which is 512 Kilobytes (128 pages), and virtio-net has

在文檔中在KVM虛擬機器中支援OpenCL圖形加速裝置 (頁 32-66)