Open Computing Language, short for OpenCL, is an open standard that is widely adopted by academia and industry for heterogeneous computing. The standard can be broken down into two parts. The first part is the host (i.e. CPU) side C like API [15, 17] that is used to interact with the device (e.g. GPU). The second part is the device side programming language specification based on C99 standard. OpenCL was initially developed by Apple Inc., and is now maintained by Khronos Group.
The first few updates of OpenCL (OpenCL 1.1, 1.2) added new functionality to make it more flexible, including new data types and built-in functions, commands handling from multiple host threads, and IEEE-754 compliance for single precision floating math. It is not until the release of OpenCL 2.0 that we can see the grand picture of heterogeneous computing. OpenCL 2.0 introduced new features such as Shared Virtual Memory (SVM), Dynamic Parallelism, and Platform Atomics. These features give GPU more control over data and execution compared to OpenCL 1.2, and also make CPU-GPU cooperation more flexible. These new features will be further explored in detail in the following sections.
2.3.1 Shared Virtual Memory
Traditional GPU programming model treats CPU and GPU memory space as two different memory space. The only way to make CPU side data accessible by GPU is through ex-plicit memory copies (i.e. clReadBuffer and clWriteBuffer in OpenCL). The data copy has problems like duplicated contents in both CPU and GPU memory, which leads to resource utilization inefficiency, and data transfer overhead, which will be a potential bottleneck if the input data size is large. In addition, the disjoint memory spaces of CPU and GPU make using pointer-based data structure like linked-lists and trees very difficult.
But the problems mentioned above no longer exist, as OpenCL 2.0 adds the new fea-ture, Shared Virtual Memory (SVM). SVM in OpenCL guarantees that CPU and GPU will have a common virtual address region as long as this region is a memory buffer allocated by OpenCL-provided API (clSVMAlloc). For parallel computing, it is an important issue that how SVM buffer, as a shared resource, will be updated. The OpenCL specification defines two types of SVM implementations [8] based on the synchronization granularity of the SVM buffer:
1. Coarse-grained: Synchronization is provided when mapping (clEnqueueSVMMap) of unmapping (clEnqueueSVMUnmap) of SVM buffer, as well as kernel launch and completion. Coarse-grain SVM buffer has a fixed virtual address for all the devices.
2. Fine-grained: Synchronization is provided on the points including those defined for coarse-grained SVM, as well as atomic operations.
Here synchronization refers to an update to SVM buffer so that the new contents can be visible to all other devices. Since gem5-gpu already supported hardware fully coher-ent memory between CPU and GPU, the SVM implemcoher-entation will be fine-grained. We believe that fine-grained SVM is the trend for the future CPU-GPU system because com-pared to coarse-grained SVM, fine-grained SVM not only is easier to use but also makes CPU and GPU co-working on shared data possible.
2.3.2 Dynamic Parallelism
The Dynamic Parallelism, or device enqueueing feature in OpenCL 2.0 allows a GPU to launch kernels when running a kernel without giving communicating back to CPU. This feature enables more flexible algorithm design and also makes it possible to handle some irregular or data-dependent applications more efficiently for GPU. Here we name two use cases that could benefit from dynamic parallelism:
1. Nested Parallelism: In some irregular applications, because the data structure’s property or the algorithm design itself, the loading of work is unable to know until runtime, which makes it hard to partition tasks equally for each thread. For exam-ple, in a graph algorithm like BFS, a kernel may map each thread to each nodes in the graph and make them search for their neighbor nodes. Since the number of neighbor nodes for each node is unknown at compile time, normally programmers will use looping in each thread. However, due to the different loop iteration in each thread, there could be serious load imbalance problems. With dynamic parallelism, programmers can easily launch new kernels to search a node’s neighbors in parallel to balance the load.
2. Data-dependent Parallelism: Data-dependent parallelism happens when one task depends on the other task’s result. In the traditional GPU programming model, a programmer can only wait for the first kernel to finish so that he can launch a second kernel if the two kernels are data dependent. The overhead of control transfer be-tween GPU and GPU could drag down the performance. With dynamic parallelism, programmers can launch new kernels once a part of a kernel is done to process on the partial output, and the degree of parallelism is determined at runtime, which makes it flexible enough to handle any kind of data-dependent parallelism.
Although there are many potential benefits from dynamic parallelism, previous work [33]
shows that improper use of dynamic parallelism might result in too many kernels being launched, which causes high kernel launching overhead and large memory footprint. As a
consequence, programmers must be cautious when designing algorithms that use dynamic parallelism.
2.3.3 Platform Atomics and Enhanced Atomic Operations
The SVM feature in OpenCL 2.0 gives opportunity for CPU and GPU cooperating on the same data set. But if CPU and GPU thread will touch the same memory address during runtime, a method for synchronization is needed to guarantee their updates in a way pro-grammers want it to be. Traditional GPU programming model uses atomic operations and barriers to synchronize among GPU threads. OpenCL 2.0 inherits atomic operations from the traditional GPU programming model, and adds up a new level of atomic operations, the platform atomics. Using platform atomics will make an update atomically visible to all other devices connected to the SVM. The platform atomics feature is designed to be compatible with the CPU side atomic operations so it doesn’t require any change in CPU code. Programmers can use C11 or C++11 atomic operations on the CPU side and use platform atomics on the GPU side to make sure that the sharing data between CPU and GPU will be updated correctly.
In addition to platform atomics, OpenCL 2.0 replaces the old OpenCL 1.2 atomic functions with the new C11-like atomic functions, which is a more flexible interface and make programmers easier to use.
2.3.4 Work-Group Built-in Functions
The new work-group built-in functions introduced in OpenCL 2.0 provide programmers a high-level function interface to let threads perform reduction, broadcasting, and scanning within a work-group. These functions ease the burden for programmers to write com-plexed code, and further increase the programmability of OpenCL. There are three types of such functions:
1. Scan: The scan operation returns the result of sum, min, or max for all threads with thread ID less than current thread, optionally including current thread. Figure 2.2a
shows an example of work_group_scan_inclusive_add operation with eight threads in a work-group. Here the ”inclusive” indicates that this operation sums up the result including the thread calling this function.
2. Reduce: The reduce operation returns the result of sum, min, or max among all threads in a work-group. Figure 2.2b shows an example of work_group_reduce_min operation with eight threads in a work-group, which returns the minimal value in a work-group.
3. Broadcast: The broadcast operation returns the data of target thread by specifying target thread ID. Figure 2.2c shows an example of work_group_broadcast operation with eight threads in a work-group. All threads in a work-group will have the data value from thread 0.