Applications having high data parallelism enjoy greater performance and power efficiency from SIMD (Single Instruction Multiple Data) computing devices.
Consider the x86 architecture, for example, four operations in one SSE (Streaming SIMD Extension) instruction and eight operations in one AVX (Advanced Vector Extension) instruction can be processed simultaneously. Such performance and power efficiency in SIMD processing has been further pushed up to GPGPU, where a much greater number of cores can be used to compute in SIMD or SIMT (Single Instruction Multiple Thread) fashion.
SIMT is the strategy of using a large of number of threads in parallel, but each thread executes the same instruction on a different data section allocated to this thread. Like SIMD processing, the overhead of instruction fetch, decode, and speculation can be effectively eliminated, SIMT has a higher power efficiency. The device memory hierarchy in GPGPU also contributes to the greater power efficiency of SIMT processing. Using CPU and GPGPU collaboratively to achieve greater
performance and power efficiency is the current trend of computing. This is often referred to as heterogeneous computing since the GPGPU is often using a different ISA (Instruction Set Architecture) from the CPU.
CPU, as designed for general-purpose computing, is less power efficient for SIMD processing. General purpose computing often incurs complex instruction execution control flow, which requires sophisticated branch prediction, speculation, out-of-order execution, and cache hierarchies. For processing a large and regular section of data, SIMD or SIMT architecture could yield much greater power efficiency. The motion for heterogeneous computing is to leave the logic control portion of an application to the CPU and let the GPGPU handles the regular data
parallel sections.
One challenge of heterogeneous computing is how to program for the two different ISAs within the same application. CUDA from NVIDIA is one of the earliest programming model for heterogeneous computing. However, CUDA is designed for NVidia devices only, not portable for other GPGPU devices. OpenCL (Open
Computing Language) is a programming model, initially developed by Apple, and later promoted by the Khronos Group. As the open standard for heterogeneous computing platforms. Early programming models for heterogeneous computing platforms focus too much on the efficient use of device memory hierarchies. They are not programmer friendly. For example, the current existing GPGPU
programming model requires explicit control of data transfers between the host memory and the device memory. Furthermore, debugging based on such models is difficult, as the kernel functions running on the device fails to provide the basic debugging functions like the single step execution and break point setting. Due to the separate memory spaces, a device memory pointer can access the device memory only. Such difficulties calls for a revised programming model based on HSA (Heterogeneous System Architecture) from AMD for heterogeneous computing platforms.
The main idea of HSA is to make the GPGPU software development easier. The memory model of HSA is called HUMA (Heterogeneous System Architecture), which offers a shared virtual memory space between the host and the device. Since the memory pointers are now shared, such sharing strategy allows programmers accessing the device and host memory with the same memory pointers. Thus software developers can focus more on computing algorithms rather than on
managing explicit memory copies between the separate address spaces. In addition,
HSAIL (Heterogeneous System Architecture Intermediate Language) is an intermediate representation of HSA for machine independent code distribution.
The machine independent code distribution is achieved through the on device code generation from HSAIL to the native code. The native code generated is linked and loaded by an on device linker and loader. Precise definition of parallel data syntax, no high level structure representations and a finite register set in HSAIL allow programmers to get a more thorough understanding of their code. Vector
instructions in HSAIL offer chances of straightforward SIMD instruction generation with less analysis in speeding up the device code generation. Cross work group and cross lane operations in HSAIL also provide finer control on data computing. . The rest of this paper uses the term agent for the host code following the terminology of HSA.
Our work is to design a functional level system mode simulator for HSA. The simulator runs two guest machines at the same time. One guest machine is an ARM processor as the current embedded systems are mostly based on ARM processors.
The other guest machine we simulate is the GPGPU. Both the ARM guest machine and the GPGPU are emulated by the x86-based host machine. The ARM guest is emulated by one x86 core, while the GPGPU is emulated by many x86 cores on the host machine. Therefore, the emulated GPU can make use of the multi-core power available in the host machine in order to achieve faster simulation.
As one of the first few groups developing for the HSA framework, we are facing many challenges. For example, no OpenCL/HSA compilers are available at this time.
In order to test our HSA simulator, we must come up with our own HSAIL code.
We have implemented a binary generator for translating HSAIL text into HSAIL binary code called Brig. This tool is called Brig generator. Following the HSA
specification, the communication between the host and the device is via queues which are explicitly managed through AQL (Architected Queuing Language). AQL packets are used to specify the required kernel execution information. In the
QEMU-based HSA simulator, the simulation of GPGPU is based on binary translation, which the Brig code is translated into native binaries to be executed directly on the host machine. On a real heterogeneous machine, this step is called finalization which translate the Brig code into device code to be running on the GPGPU device.
Since we do not know which GPGPU device will be used, the functionality of the device is simulated by the host machine. Therefore, the current finalizer simply translate the Brig code into native binaries to be executed on the host machine. This Brig-to-Host-binary translator is the main theme of this thesis work. It is built as a library for the QEMU, performs the device code generation. Native code is linked through the HSA link-loader implemented in the QEMU. Thereby, the HUMA and the HSAIL on device code generation requirements are met in our simulator.
Our implementation of the HSA simulator follows the HSA 1.0 specification. In the rest of this thesis, we will introduce the background of this work in chapter 2.
Related work will be introduced in chapter 3. Design of our project framework will be introduced in chapter 4. Implementation details will be described in chapter 5.
Experimental results will be discussed in chapter 6 and conclusion as well as future work will be given in chapter 7.