4.2 Experimental Results
4.2.1 Validation
We validate our simulator by comparing the normalized running time of our simulator and real hardware. We choose AMD Kaveri A10-7850k APU as our comparing target.
Figure 4.1 shows the correlation of normalized running time of BF, OF, RT, and SIFT, with
(a) BF (b) OF
(c) RT (d) SIFT
Figure 4.1: Real Hardware and Our Simulator’s Normalized Running Time Correlation.
perform well in real hardware perform well in our simulator, and applications perform poorly in real hardware perform poorly in our simulator.
4.2.2 Shared Virtual Memory
As described earlier, SVM frees programmers from copying data between CPU and GPU back and forth. This is the most easy-to-use feature as all OpenCL 1.2 applications can be rewritten to use SVM by replacing all the buffer management API in the host code with SVM API, and it doesn’t require any modification in the kernel code.
Figure 4.2 shows the speedup of four applications using SVM compared to the ones using OpenCL 1.2 explicit memory copy API. Among the four applications, only BF’s OpenCL 2.0 kernel code has been modified to use dynamic parallelism, the other three use same kernel code as the OpenCL 1.2 version to run the experiment. The speedup for BF, OF, RT, and SIFT are 71.3%, 11.2%, 16.5%, and 4.6% respectively. Figure 4.3 breaks down the running time of applications into two parts: memory copy time and kernel execution time. As we can see, the kernel running time remains almost the same except for BF because of different kernel code used in its OpenCL 2.0 version, and the portion of memory copy time for BF, OF, RT, and SIFT are 40.0%, 10.2%, 14.3%, and 4.6%
Figure 4.2: Performance of applications using SVM normalized to applications using memory copy
Figure 4.3: Running time breakdown for applications using SVM
respectively, which match the speedup shown in figure 4.2. To sum up, for our tested applications, SVM can improve the performance of traditional OpenCL 1.2 applications by eliminating their memory copy time, and the impact of replacing memory copy with SVM on kernel execution seems to be very little.
4.2.3 Dynamic Parallelism
Figure 4.4 shows the performance of applications using dynamic parallelism normalized to the ones not using dynamic parallelism. BF has 5.2% speedup and PRK has 19.5%
degradation. The different impact on the two applications comes from their different dy-namic parallelism usage. For PRK, this is a typical case of using dydy-namic parallelism to conquer nested parallelism: each thread will spawn a child kernel to process the task in
parallel. Wang et al. [33, 32] have done a thorough study on the deficiency of dynamic parallelism. PRK also has the same problems. First, on average, the number of threads in a child kernel spawned in PRK is only 160. The small child kernels cause SMs to be underutilized and thus degrade the performance. Second, the child kernels of PRK are very memory-intensive. The child kernel code contains six memory operation in only 3 lines of code. Memory stalls caused by such memory intensity is hard to be hidden when the number of threads in an SM is small. On the other hand, BF exhibits different usage of dynamic parallelism, which leads to its performance improvement. The OpenCL 1.2 version of BF is implemented as a producer-consumer program where the host will not launch consumer kernel until producer kernel has finished, as shown in figure 4.5a, but in fact, a thread in the consumer kernel only depends on a part of the producer kernel’s results. In the algorithm level, it is more ideal for a part of the threads in consumer kernel to start once their data is ready rather than waiting for the whole data to be ready. The former has potential to have a part of producer and consumer kernel run together to in-crease the resource utilization, but it is the OpenCL 1.2 programming model that restricts the implementation to be the latter. The OpenCL 2.0 version of BF uses dynamic paral-lelism to achieve this. As shown in figure 4.5b, each thread of the producer kernel can launch a child kernel, which is a part of the job in the OpenCL 1.2 consumer kernel, when it’s finished. Such implementation can have opportunity for the running time of producer and consumer to be overlapped, and thus increase the SM utilization and performance. A key factor to the overlapped time is the producer kernel’s work-group running time di-vergence. If the running time divergence among work-groups is high, then there will be more chance to start a child kernel to run with the producer kernel in parallel. In addition to this overlapping effect, the number of threads in BF’s child kernel is 1920, which is much larger than PRK’s, so that the SM utilization is higher and the capability of hiding memory latency is better.
More recent GPU architecture, e.g. NVIDIA’s Kepler and Maxwell [7, 6], has intro-duced a new hardware feature – concurrent kernel execution within an SM. When there are multiple kernels running on a GPU, this feature can achieve more fine-grained spatial
Figure 4.4: Performance of applications using dynamic parallelism normalized to OpenCL 1.2
sharing and raise the SM utilization rate. For applications launching small child kernels like PRK, this feature can help alleviate the SM underutilization problem. Currently we haven’t implemented this feature in the simulator. We’ll leave it as our future work.
4.2.4 Work-Group Built-in Functions
In this section, we’ll compare the performance of two applications using OpenCL 2.0 work-group built-in functions and using OpenCL 1.2 functionally equivalent implementa-tion. The main difference between the OpenCL 2.0 and 1.2 implementation is that OpenCL 2.0 uses new PTX warp shuffle instruction while OpenCL 1.2 uses shared memory buffers and barriers to exchange data. We illustrate the difference further in detail in figure 4.6, where we use warp shuffle and shared memory to achieve data exchange, which will be frequently used by work-group built-in functions. Figure 4.6a shows a typical way in OpenCL 1.2 to perform data exchange. All threads write to a shared memory buffer and use barrier instruction to make sure all threads have done writing. Then each thread reads from that shared memory buffer to retrieve other thread’s data. The whole procedure is very inefficient as threads have to synchronize using barrier and write redundant value to shared memory, but that’s not the case in OpenCL 2.0. As shown in figure 4.6b, the warp shuffle instruction can let one thread to retrieve another thread’s register value. This can
(a) OpenCL 1.2 producer-consumer workflow
(b) OpenCL 2.0 producer-consumer using dynamic parallelism
Figure 4.5: Bilateral filter’s overlapping effect using dynamic parallelism
eliminate the redundant shared memory read / write compared to OpenCL 1.2, and thus improve the performance.
Figure 4.7 shows the performance of two applications normalized to OpenCL 1.2 shared memory implementation. BIS and RMQ have 23.1%, 12.8% improvement respec-tively. The reason why BIS has better performance improvement is because the imple-mentation difference behind the work-group built-in functions they use. Figure 4.8a, 4.8b show the implementation difference behind work-group reduce min and work-group scan add, where threads in read are running. As we can see, when performing the operation, work-group reduce min will have less threads running compared to work-group scan add.
This makes work-group scan add have more chances to use warp shuffle to improve the
performance compared to work-group reduce min. And because BIS uses work-group scan add, it shows better performance improvement compared to RMQ.