4.2 Experimental Results
4.2.5 Platform Atomics
The platform atomic functions enable CPU and GPU processing on the same data, making heterogeneous computing one step further to embrace a broader range of applications. To support the true data sharing between CPU and GPU, some cache coherence mechanism is required. In previous work, Jason et al. [25] have stated that the massive thread accesses from GPU may overwhelm the directory for a traditional directory coherence protocol, which may be a potential bottleneck for heterogeneous system coherence. Their analysis also shows that the number of directory accesses from GPU can be more than one per GPU cycle, which is hard to support in terms of power and area overhead. However, their experiments are done on the traditional OpenCL-1.2-like applications where CPU launches tasks to GPU, and then waits for GPU to finish. Such applications’ coherence needs only occur at the point of a kernel’s beginning and end, which are predictable and easily to be optimized. In this section, we’ll analyze two applications from AMD’s APP SDK that exist the behavior of CPU-GPU co-working. Their coherence needs are not only at the point of kernel’s beginning and end but throughout the whole running time of a kernel.
Figure 4.9 shows the directory accesses every 100 cycles. Both of the applications’
running time have a similar trend. We divide them into two phases: the initial phase, which is the problem pointed out by previous work, existing burst access behavior and
(a) OpenCL 1.2 using shared memory to ex-change data
(b) OpenCL 2.0 using warp shuffle to exchange data
Figure 4.6: Demonstration of performance difference between OpenCL 1.2 and 2.0.
Figure 4.7: Performance of applications using work-group built-in functions normalized to OpenCL 1.2
(a) Work-group reduce min implementation demonstration.
(b) Work-group scan add implementation demon-stration.
Figure 4.8: Implementation difference behind different work-group built-in functions.
lasting only for a very short time at the beginning of the kernel, and the execution phase, which includes the rest of the running time and has a more steady access behavior, is what we will focus on. As we can see, the directory access behavior during the execution phase is not as intensive as it is during the initial phase. In fact, figure 4.9 shows that the number of directory accesses in both of the applications never exceed more than one access per cycle, which indicates that the coherence request will not overwhelm the directory. The reason is due to the significant data processing speed difference between CPU and GPU.
For GPU, it will generate lots of memory accesses in parallel, but for CPU, it can only access data in memory sequentially. As a result, even though there will be burst accesses to directory during the initial phase because of the compulsory miss in GPU L2 cache, during the execution phase, most of the accesses will hit in GPU L2 cache and only those cache lines touched by CPU will miss, which is unlikely to cause many directory accesses
(a) FGSC (b) SABTI Figure 4.9: Number of Directory Access Every 100 Cycles
as CPU’s data processing speed is not fast enough.
Currently all the experiments are done with a single thread CPU program co-working with a GPU program. According to figure 4.9, the max number of directory accesses per cycle during the execution time is about 40. We believe that by running a multi-threaded program, this number can still scale up, so it is possible that the directory is a bottleneck for this type of applications. We’ll leave it as our future work.
Chapter 5
Related Works
5.1 Heterogeneous Computing Simulator
Currently in the field of computer architecture research, the following are simulators that are widely used to do GPU and heterogeneous computing-related research. gpgpu-sim [10] is a cycle-level execution-driven GPU gpgpu-simulator that gpgpu-simulates NVIDIA’s Fermi architecture GPU. It supports CUDA 4.0 and OpenCL 1.2, and provides a fake interceptor-like library. Once GPU calls the library, it will be trapped into the gpgpu-sim session, simulating execution of a real discrete GPU. gpgpu-sim does not simulate CPU-side exe-cution, so it’s unable to simulate a full system CPU-GPU environment. Wang et al. [32]
have modified the source code of gpgpu-sim to make it simulate a part new hardware fea-tures in NVIDIA’s Kepler architecture, including dynamic parallelism. gem5-gpu [26] is a detailed event-driven heterogeneous CPU-GPU simulator that integrates CPU simulator gem5 and GPU simulator gpgpu-sim into having the same physical memory and address space. Currently gem5-gpu only supports CUDA 4.0. Multi2Sim [31] is an execution-driven heterogeneous CPU-GPU simulator that is able to simulate various GPU architec-tures, such as NVIDIA’s Fermi and AMD’s Southern Islands, and support both CUDA 4.0 and OpenCL 1.2 programming model. MARSSx86-PTL-SIMT-GPU [34] is a trace-driven CPU-GPU heterogeneous simulator, but due to its trace-driven nature, its simulation ac-curacy is not as realistic as gem5-gpu or Multi2Sim, thus is less used in academia. Both Multi2Sim and MARSSx86-PTL-SIMT-GPU only have separated CPU and GPU
mem-ory. Among all these simulators, only gem5-gpu integrates CPU and GPU into connect-ing to the same physical memory and sharconnect-ing the same address space. Our work supports OpenCL 2.0 and is the first simulator that supports OpenCL 2.0.