Case C - EXPERIMENTS AND DISCUSSION - 一個為異質系統仿真器以LLVM為基準的二元轉譯器

6. EXPERIMENTS AND DISCUSSION

6.3 Case C

This section demonstrates result of the kernels with barriers. The benchmarks are the Reduction Sum and the Prefix Sum.

The Reduction Sum performs a Summation of a vector with a parallel reduction in each work group adding a part of data. Summation in each work group needs to

wait for the duplication of the data to a buffer as well as the synchronization of all the work groups before adding up all the result. Thereby, there are two barriers present in the Reduction Sum kernel.

The Prefix Sum performs a Summation of the data in the current index and the previous index. When summation preforming in index (i), the summation of the index (i-1) must be ready. According to the parallel Prefix Sum algorithm, there are two barriers existing.

Kernel name Number of operations Number of helper functions

Reduction Sum 109 37

Prefix Sum 108 21

Table 6.3. Number of operations presents in Reduction Sum and Prefix Sum.

The input data sizes of the Reduction Sum and the Prefix Sum are 256KB and 128KB respectively.

Figure 6.5. The Reduction Sum running on 24-core AMD machine versus 6-core Intel machine.

Figure 6.6. The Prefix Sum running on 24-core AMD machine versus 6-core Intel machine.

The Reduction Sum benchmark evaluates the work group barrier

synchronization of the HSA Simulator. There is no performance gain with the increasing number of threads due to the two barriers in the work group level. We leverage the multi-threads from parallel execution of the work items in the Case A and B. On the contrast, in current case we are hard to benefit from the threads because the barriers always stop the work group from parallel processing until all the work groups have processed to the synchronization point. Whilst the threads within a work group process concurrently before the barriers, only one work group is allowed to process in each work group dispatching loop instance. Theoretically, the execution time with different number of threads in this case should be no change. However, due to the unlikely dispatching and synchronization of the pthreads in the operating system every time, the graph shows like a zigzag in return.

Unlike the Reduction Sum, the Prefix Sum illustrates an extreme case of the use of barriers, which helps to evaluate the work item level barrier of the HSA Simulator.

The Reduction Sum needs to synchronize due to the problem of the race condition,

which is necessary to prevent two or more threads in rewriting the same data.

However, the Prefix Sum is the problem of the dependency within the work items.

Every work item must be done after the previous work item finishing the data write in. Thereby, such extreme case of synchronization demonstrates as the graph shows.

The zigzag like graph can apply the explanation from the Reduction Sum.

Conclusions

This paper presents a translator of a re-targetable IR HSAIL to the native code called the HSA Translator for the fast simulation using in the HSA Simulator. The HSA is the software development solution carried out from the AMD aimed to ease the GPGPU programming. With the shared memory model and the vector instructions, the HSA provides an easier way to the GPGPU programmers. The HSA Translator is implemented based on the LLVM Infrastructure and performs a runtime translation for the HSAIL to the native re-locatable object. In addition, the output of the HSA Translator is a

re-locatable object due to the absent of linker support in the LLVM. We cover this linker problem with the HSA Link-loader which is not the achievement of this paper. However, no existing compiler can generate the HSAIL code. Thereby, another achievement of this paper is to implement a Brig generator for the generation of the HSAIL binary format called the Brig from the HSAIL text format. The text format HSAIL is written manually and five of the six programs are one of the achievement of this paper. The HSA

Translator works as a finalizer in the HSA Simulator. Experimental result shows the evaluation of the HSA Translator. Two different servers are used to evaluate the HSA Translator. Two servers with different number of cores show the scalability. With the memory bandwidth difference, the memory scalability is illustrated to have

performance difference.

Reference

[1] Jiun-Hung Ding, Po-Chun Chang, Wei-Chung Hsu, Yeh-Ching Chung, "PQEMU: A Parallel System Emulator Based on QEMU," icpads, pp.276-283, 2011 IEEE 17th Internatio-nal Conference on Parallel and Distributed Systems, 2011.

[2] A. Bakhoda, G. L. Yuan, W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, April 2009.

[3] Bor-Yeh Shen, Pei-Shiang Hung, Jyun-Yan You, Wei Chung Hsu, Wuu Yang, National Chiao Tung University, Taiwan, LLBT: and LLVM-based static binary translator, Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems.

[4] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Princeton University, Brown University, Automatic CPU-GPU Communication Management and Optimization, [5] Chris Lattner, LLVM: A Compilation Framework for Lifelong Program Analysis &

Transformation, UIUC, CGO’04.

[6] Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: A Parallel Functional Simulator for GPGPU. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2010.

[7] Gregory Diamos, Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems, Georgia Tech, PACT 2010.

[8] Vitaly Zakharenko, FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems, Toroto MS Thesis, 2012.

[9] Chun-Chen Hsu, Pangfeng Liu, Chien-Min Wang, Jan-Jan Wu, Ding-Yong Hong, Pen-Chung Yew, Wei-Chung Hsu, LnQ: Building High Performance Dynamic Binary Translators with Existing Compiler Backends.

[10]

https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Main_Page

[11]

http://developer.amd.com/tools-and-sdks/heterogeneous-computing/

amd-accelerated-parallel-processing-app-sdk

在文檔中一個為異質系統仿真器以LLVM為基準的二元轉譯器 (頁 59-64)