Experiment - OMP2OCL 轉換器︰可自動轉換 OpenMP 程式到 OpenCL 程式的轉換器

The experiment platform has one Intel Core 2 Quad Q6600 as the host and one NVIDIA GeForce 9800 GT as the device. Using oclBandwidthTest from NVIDIA GPU Computing SDK, the bandwidth from the host to the device is around 2330 MB/s, and the bandwidth from the device to the host is about 1900 MB/s.

Our experiment is based on two important kernels (JACOBI and SPMUL) and one NAS OpenMP Parallel Benchmark (EP). Table 4-1 describes these three benchmarks, sorted from top to down according to the increasing order of the numbers of source lines. The following sections present the experiment results with respect to each benchmark.

Table 4-1 Descriptions of the Benchmarks

Benchmark Description # of Source

Lines of a system of linear equations with largest absolute values in each row and column dominated by the diagonal elements optimizations included in OMP2OCL translator. As a result, three optimizations, Parallel Loop-Swap (PLS), Caching of Frequently Accessed Global Data (Caching), and Memory Transfer Reduction (MTR) are conducted in the experiments.

Table 4-2 Optimizations Implemented by the Related Work and OMP2OCL Translator

The Related Work OMP2OCL Translator

Parallel Loop-Swap (PLS) √ √

Loop Collapsing (LC) √

Caching of Frequently Accessed Global Data (Caching) √ √

Matrix Transpose of Threadprivate Array (MT) √

Memory Transfer Reduction (MTR) √ √

4.1 JACOBI

Without any optimization, both the CUDA program and the OpenCL program output by the related work and OMP2OCL translator, respectively, perform worse than the original OpenMP program because of the uncoalesced global memory accesses in the kernel functions.

The uncoalesced global memory accesses are resulted from the fact that each work-item accesses arrays in row-wise scheme. Accordingly, Parallel Loop-Swap in Phase 1 can be used

Figure 4-1 Experiment Results for JACOBI

Using Parallel Loop-Swap in Phase 1, the performances of both the CUDA program and the OpenCL program increase substantially and their execution time are around 57.13% and 57.33% compared to the OpenMP program and the OpenCL program with no optimization respectively.

Using Memory Transfer Reduction in Phase 2, the performances of both the CUDA program and the OpenCL program improve just very little due to the small amount of memory transfers between the host and the device. Before optimization, the sizes of memory transfers are around 32 MB for each direction. After optimization, the size of memory transfers is reduced to around 16 MB for the direction from the host to the device, whereas the size of memory transfer is remained the same for the direction from the device to the host.

Using both Parallel Loop-Swap in Phase 1 and Memory Transfer Reduction in Phase 2, the combined effect of the optimizations are as expected that the execution time of both the CUDA program and the OpenCL program are the shortest in programs output by the translator of the related work and OMP2OCL translator respectively.

For JACOBI, Caching Frequently Accessed Global Data in Phase 2 has no effect on both the CUDA program and the OpenCL program, since there are no temporal data locality regarding to global memory accesses .Accordingly, the execution times after applying this optimization are not showed in Figure 4-1.

4.2 SPMUL

Without any optimization, both the CUDA program and the OpenCL program output by the related work and OMP2OCL translator, respectively, perform better than the original OpenMP program because of the little uncoalesced global memory accesses and the large amount of computation.

By Caching Frequently Accessed Global Data in Phase 2, some array elements are cached in registers and private memories of the CUDA program and the OpenCL program, respectively. As a consequence, the memory accesses to global memory region are reduced, and, thus the performances of both the CUDA program and the OpenCL program increase a few.

Using Memory Transfer Reduction in Phase 2, the performances of both the CUDA program and the OpenCL program just increase very little due to the small amount of memory transfer reduction. Before optimization, the sizes of memory transfers are around 85.64 MB from the host to the device and about 7.87 MB from the device to the host. After optimization, the size of memory transfers is reduced to around 77.77 MB from the host to the device, whereas the size of memory transfer is remained the same from the device to the host.

Figure 4-2 Experiment Results for SPMUL

Using both Caching Frequently Accessed Global Data in Phase 2 and Memory Transfer Reduction in Phase 2, the combined effect of the optimizations are as expected that the execution time of both the CUDA program and the OpenCL program are the shortest in programs output by the related work and OMP2OCL translator respectively.

For SPMUL, Parallel Loop-Swap in Phase 1 has no effect because suitable loop nest described in Subsection 3.1.1 for optimization does not exist, and, as a consequence, the execution times after applying this optimization are not showed in Figure 4-2.

4.3 EP

Figure 4-3 Experiment Results for EP

Without any optimization, both the CUDA program and the OpenCL program output by the related work and OMP2OCL translator, respectively, perform better than the original OpenMP program because of the characteristics of EP, which has little uncoalesced global memory accesses and large amount of parallelizable computation.

respectively. As a consequence, the memory accesses to global memory region are reduced, and, thus the performances of both the CUDA program and the OpenCL program increase a few.

For EP, Parallel Loop-Swap in Phase 1 and Memory Transfer Reduction in Phase 2 have no effect on both the CUDA program and the OpenCL program; in other words, there is no suitable loop nest for Parallel Loop-Swap and no suitable global data for Memory Transfer Reduction. Consequently, the execution times after applying this optimization are not showed in Figure 4-3.

4.4 Discussion

Parallel Loop-Swap in Phase 1 resolves uncoalesced global memory accesses in a loop nest with regular data accesses. Coalesced global memory accesses are significant in both CUDA programs and OpenCL programs with GPUs as the devices. In our experiment platform, there is a GPU. Therefore, for JACOBI, both the CUDA program and the OpenCL program while applying this optimization perform well compared to those without this optimization and the original OpenMP program. However, with devices rather than GPUs such as CPUs, this optimization might decrease the performance instead.

In the experiment platform, caching frequently accessed global data in private memories and local memories of OpenCL has benefits to the performance of both the CUDA program and the OpenCL program, because the latencies to private memories and local memories are shorter than that to global memory. Nevertheless, with devices rather than GPUs such as CPUs, the latency reduction by Caching might not be significant.

Memory Transfer Reduction in Phase 2 removes unnecessary memory transfers. If the memory transfers locate in a loop nest with many iterations, then the performance improvement could be substantial. However, this situation is not present in the experiments.

Thus, for JACOBI and SPMUL, the performances of both the CUDA program and the OpenCL program while applying this optimization improve a little compared to those without applying this optimization.

在文檔中 OMP2OCL 轉換器︰可自動轉換 OpenMP 程式到 OpenCL 程式的轉換器 (頁 40-46)