• 沒有找到結果。

5. IMPLEMENTATION

5.1 BRIG Generator

The BRIG generator is used to translate the HSAIL from text format to BRIG format, which is the binary format of the HSAIL. The BRIG generator consists of the HSAIL Lexical analyzer and the HSAIL rule parser. This implementation is based on the LEX/YACC structure. Both these two components conform to the HSAIL 1.0 version.

The HSAIL Lexical analyzer takes in the HSAIL text code and passes the tokens to the HSAIL rule parser for rule mapping. Tokens include the identifiers, the constants, the operation codes and the modifiers etc. Rules written in the HSAIL parser is implemented according to the EBNF (Extended Backus-Naur Form) grammar rules in the HSAIL specification document.

The BRIG format stores the components into five sections. They are the string section, the directive section, the code section, the operand section and the debug section. Each section has a corresponding section header consists of a 32-bit integer indicating the size of the section. Zeros are padded to the strings to make every string writing in the file a multiple of four. Every IR structure size is also in multiple of four. Therefore, the size of a BRIG file must be in multiple of four.

In addition, the HSAIL version 1.0 and the version before 1.0 have several differences,

1. The writing format of the BRIG – the version before 1.0 has 5 offsets

indicating the sizes of the five sections appear at the beginning of the BRIG. Instead, the sizes of the sections are located at the beginning of each section.

2. The format of the encoding of the string section – the version before 1.0 deduces each string in the string section with the null character. Such mechanisms makes the decoding to have a slower response due to the sizes of the strings are unknown until the null characters are

encountered. In the version 1.0, each string has the corresponding size before the string. In addition, the sizes of the strings are padded to the multiple of four with zeros.

3. More native library functions – such as the floor and ceiling floating point operations are unavailable in the version before 1.0.

According to the specification of the HSAIL, we need to preserve the order of the sections in a BRIG file. For example, the string section is the first section and the operand section is the fourth section. When we parse a register, the identifier of the register should be in the string section and the register IR structure should be in the operand section. In order to write the components to corresponding sections, the IR structures and the strings are stored in the corresponding section buffers within the generation instead of directly writing to the file. When the parsing of the HSAIL text code is finished, buffers are written into the corresponding sections of the binary file. The BRIG is used as the input of the HSA Translator in the GPU guest machine for the HSAIL to native code translation.

This part is included in the scope of this paper.

5.2 HSA Translator

The execution of the GPGPU program with the HSA conformed simulation is

shown in figure 5.1. The BRIG binary and the ARM agent binary are placed in the memory of the ARM guest machine. Users execute the agent code in in the ARM guest machine. When the kernel call in the agent code is being executed, the BRIG is copied to the QEMU physical memory. By reading the information in the AQL packet, the GPU simulating target finds the BRIG file and calls the HSA Translator via a library function call. The memory pointer pointing to the BRIG is the parameter of the HSA Translator.

The BRIG format has a section header before every section. The section header indicates the size of the section. The HSA Translator reads in the BRIG into five buffers. The sizes of the buffers are allocated according to the size specified by the section headers. Reading in the BRIG is necessary because we need to process between sections frequently. Using file pointer to traverse the file may cause severe overhead due to the frequent use of system call. Thus we sacrifice the space of storing the BRIG file into several memory buffers instead of such system calls.

A BRIG IR structure always starts with a 32-bit integer indicating the size of the structure. Following is a 32-bit integer indicating the kind of the structure. Making use of this information and knowing which section we are reading, we can

distinguish all the structures. Decoding process starts with the decoding of the directive section. Code section is accessed if the directive has a code body, such as the kernel and function. Likewise, operand section is accessed if the operation in the code section needs any operands.

(a) HSAIL vector addition kernel code:

version 1:0:$large;

kernel &__OpenCL_vec_add_kernel(kernarg_u32 %arg_val0,

kernarg_u32 arg_val1, kernarg_u32 %arg_val2, kernarg_u32 %arg_val3) { @__OpenCL_vec_add_kernel_entry:

ld_kernarg_u32 $s0, [%arg_val3];

workitemabsid_u32 $s1, 0;

cmp_lt_b1_u32 $c0, $s1, $s0;

ld_kernarg_u32 $s0, [%arg_val2];

ld_kernarg_u32 $s2, [%arg_val1];

ld_kernarg_u32 $s3, [%arg_val0];

cbr $c0, @BB0_2;

brn @BB0_1;

@BB0_1:

ret;

@BB0_2:

shl_u32 $s1, $s1, 2;

add_u32 $s2, $s2, $s1;

ld_global_f32 $s2, [$s2];

add_u32 $s3, $s3, $s1;

ld_global_f32 $s3, [$s3];

add_f32 $s2, $s3, $s2;

add_u32 $s0, $s0, $s1;

st_global_f32 $s2, [$s0];

brn @BB0_1;

};

(b) LLVM Instructions:

; ModuleID = 'hsail'

define void @__OpenCL_vec_add_kernel(i32 addrspace(1)* %arg_val0, i32 addrspace(1)* %arg_val1, i32 addrspace(1)* %arg_val2, i32 addrspace(1)*

%arg_val3) nounwind { entry:

%addr = ptrtoint i32 addrspace(1)* %arg_val3 to i32 %0 = alloca i32

store i32 0, i32* %0 store i32 %addr, i32* %0 %1 = alloca i32

store i32 0, i32* %1

%2 = call i32 @helper_hsa_get_global_id(i32 0) store i32 %2, i32* %1

%3 = alloca i1 store i1 false, i1* %3 %4 = load i32* %1 %5 = load i32* %0

%icmp = icmp ult i32 %4, %5 store i1 %icmp, i1* %3

%addr1 = ptrtoint i32 addrspace(1)* %arg_val2 to i32 store i32 %addr1, i32* %0

%addr2 = ptrtoint i32 addrspace(1)* %arg_val1 to i32 %6 = alloca i32

store i32 0, i32* %6

store i32 %addr2, i32* %6

%addr3 = ptrtoint i32 addrspace(1)* %arg_val0 to i32 %7 = alloca i32

store i32 0, i32* %7

store i32 %addr3, i32* %7 %8 = load i1* %3

br i1 %8, label %"@BB0_2", label %"@BB0_1"

"@BB0_1": ; preds = %"@BB0_2",

%entry ret void

"@BB0_2": ; preds = %entry %9 = load i32* %1

%shl = shl i32 %9, 2 store i32 %shl, i32* %1 %10 = load i32* %6 %11 = load i32* %1

%add4 = add i32 %10, %11 store i32 %add4, i32* %6 %12 = load i32* %6

%13 = alloca float

%callLoad32 = call i32 @_load_32(i32 %12) %Fcast = bitcast i32 %callLoad32 to float store float %Fcast, float* %13

%14 = load i32* %7 %15 = load i32* %1

%add5 = add i32 %14, %15 store i32 %add5, i32* %7 %16 = load i32* %7 %17 = alloca float

%callLoad326 = call i32 @_load_32(i32 %16) %Fcast7 = bitcast i32 %callLoad326 to float store float %Fcast7, float* %17

%18 = load float* %17 %19 = load float* %13

%fadd = fadd float %18, %19 store float %fadd, float* %13 %20 = load i32* %0

%21 = load i32* %1

%add8 = add i32 %20, %21 store i32 %add8, i32* %0 %22 = load i32* %0 %23 = load float* %13

%UIcast = bitcast float %23 to i32

call void @_store_32(i32 %22, i32 %UIcast) br label %"@BB0_1"

}

define void @Kernel_Entry(i32 addrspace(1)* %parmaList) nounwind { entry:

%0 = ptrtoint i32 addrspace(1)* %parmaList to i64 %1 = add nuw i64 %0, 0

%2 = inttoptr i64 %1 to i32 addrspace(1)*

%3 = load i32 addrspace(1)* %2

%4 = inttoptr i32 %3 to i32 addrspace(1)*

%5 = add nuw i64 %0, 4

%6 = inttoptr i64 %5 to i32 addrspace(1)*

%7 = load i32 addrspace(1)* %6

%8 = inttoptr i32 %7 to i32 addrspace(1)*

%9 = add nuw i64 %0, 8

%10 = inttoptr i64 %9 to i32 addrspace(1)*

%11 = load i32 addrspace(1)* %10

%12 = inttoptr i32 %11 to i32 addrspace(1)*

%13 = add nuw i64 %0, 12

%14 = inttoptr i64 %13 to i32 addrspace(1)*

%15 = load i32 addrspace(1)* %14

%16 = inttoptr i32 %15 to i32 addrspace(1)*

call void @__OpenCL_vec_add_kernel(i32 addrspace(1)* %4, i32 addrspace(1)*

%8, i32 addrspace(1)* %12, i32 addrspace(1)* %16) ret void

}

declare i32 @_load_32(i32) nounwind

declare i32 @helper_hsa_get_global_id(i32) nounwind declare void @_store_32(i32, i32) nounwind

Figure 5.1. The mapping of the HSAIL operations and the corresponding LLVM operations

Most of the Arithmetic operations can be mapped to the LLVM operations using one-to-one mapping. Some of them are mapped to combinations of LLVM

operations,

1. The 24-bit operations – 24-bit operations is absent in the LLVM operations. We need to perform a bit mask of 24-bit after such operations.

2. The bit string, copy and multimedia operations – Whilst the LLVM operations does not provide such operations, we can achieve such functions with a combinations of the arithmetic operations.

3. The segment checking and converting operations – These two kinds of operations need the support from the environment. The current version of HSA Simulator has no work group memory implementation.

Therefore, the segment checking and converting operations are left to the future work.

4. Mathematic operations are defined by the host library. Thus helper functions are implemented for the mathematic operations.

HSAIL operations Mathematic helper functions

Table 5.2. The mathematic operations are implemented by the helper functions. The operations with “*”

is introduced after HSAIL version 1.0.

The memory operations class consists of the load, store and the atomic

operations. The load and store operations are translated to LLVM function call with different function prototypes indicating the load/store performing to different bit-width of data. In addition, due to the lack of support in work group memory, all the memory accesses are treated as the memory accesses to the flat global memory.

In addition, each HSAIL memory operations load/store the data with the help of the memory address parameters which are the memory addresses in the QEMU

memory space. Thus the memory helper functions take the memory addresses provided in the HSAIL to compute the addresses of the data in the QEMU memory.

The load operations have the data bit-width data type and the store operations are in void type. When loading a floating point data, an LLVM bit-casting operation is needed to change the data type of the data from integer to floating point. Likewise,

the store operations need a bit-casting operation to change the data type of the data from floating-point to integer type in order to perform the storing. Whilst the

operation is called bit-casting, the bit-width of the data is still unchanged and the only difference is the operating data type in the LLVM IR stage. Furthermore, the fine-grained syntax in the HSAIL for the memory operations are not supported in the current version of the HSA Simulator. Every memory operation existing in the kernel function is treated as the memory access that should be performed by every work item. Atomic operations are not supported in the current version of the simulator.

HSAIL operations Memory helper functions

ld 8 uint8_t load_8(u32 addr)

16 uint16_t load_16(u32 addr) 32 uint32_t load_32(u32 addr) 64 uint64_t load_64(u32 addr)

st 8 void store_8(u32 addr, uint8_t val)

16 void store_16(u32 addr, uint16_t val) 32 void store_32(u32 addr, uint32_t val) 64 void store_64(u32 addr, uint64_t val)

Table 5.2. The mapping table of HSAIL operations to the QEMU memory helper functions.

Branch operations except the fine-grained modifier function are mapped to the LLVM branch operations. The HSAIL function body operations are present in basic blocks. Therefore, the branch operations in the HSAIL can be mapped to the LLVM IR as the LLVM also uses the basic block as the basic component for the container of the operations. However, the compare and branch operation in the HSAIL has to be mapped to one compare and one conditional branch in the LLVM IR as there is no

such operation in the LLVM IR.

Synchronization operations such as the barriers are implemented by the helper functions. Thus we translate the barriers to helper function call. In addition, the current version of the HSA Simulator only provides the work group level thread synchronization. The fine-grained barrier is considered as the future work.

The special functions such as the work group id queuing and the work item absolute id queuing are implemented by the helper functions. Thereby we translate such operations to helper function call.

HSAIL operations Special helper functions

workitemid helper_workItemId

workitemabsid helper_workItemAid

workgroupSize helper_workGroupSize

gridsize helper_workGridSize

gridgroups helper_wordGridGroups

laneid helper_laneId

maxwaveid helper_maxDynWaveId

maxcuid helper_maxCuId

dispatched helper_dispatchId

dim helper_workDim

workitemabsflatid helper_workItemaidFlat

workitemflatid helper_workItemidFlat

Table 5.3. Operation mapping of the HSAIL special operation to the helper functions.

According to the specification of the HSAIL, all operation destinations must be registers. In order to maintain the values in the registers, we do a register table

through a set of memory pointers. In the LLVM IR, the operand types and the

destination allocated type must be the same, such as floating-point operations must have floating-point destination registers and floating-point source registers. On the contrast, the HSAIL has no specific floating-point registers. Therefore, we do the bit casting in order to meet such restriction. The bit casting is done when we need a floating-point value from an integer value with the same bit width, such as 32-bit floating-point value to 32-bit integer value, or vice versa. Whenever a value is going to be written into a register, we check the type of the value with the type of the allocation of the register. If they are not the same, we use the LLVM allocation instruction to have a new allocation with the type of the being written value for the register. This new memory pointer replaces the memory pointer of such register in the register table. The truncations and the extensions of each type of the values are also implemented to make sure the values generated fit the bit width of the

destination registers. The native registers manipulation is left to the LLVM Infrastructure.

Vector types are supported in the HSAIL. We translate these types to the LLVM vector types. The LLVM can generate the native SIMD instructions easily by the explicit use of the LLVM vector types and vector type operation.

Figure 5.2. First, the BRIG is copied to the QEMU physical memory. Second, the X86 target obtains the QEMU physical memory address of the BRIG through the reading the AQL packet. Third, the memory pointer pointing to the BRIG is passed to the HSA Translator. After the translation, a re-locatable object code is outputted to the QEMU physical memory.

A wrapper function called kernel entry in the generated LLVM IR for the pass of the memory addresses. When the kernel code is being executed, the wrapper is called first and the actual kernel function will be called by the wrapper. The wrapper is added because,

a. Parameters passing to the kernel function are in pointer types as the operations in the kernel code perform direct reading and writing from/to certain memory addresses. The number of the parameters passing to the kernel function are unable to be determined in before runtime stating. Thus we need a wrapper for the parameter passing to the actual kernel function.

b. The kernel function prototype and identifier is runtime stated. The QEMU cannot call a function with undetermined pointer name.

Therefore, we provide a fixed wrapper function pointer for calling the

kernel function.

The LLVM infrastructure does not have the linker support. Thus the HSA Translator generates these helper function calls and leaves them unlinked. Finally a re-locatable native object code with helper functions unlinked is outputted to the QEMU physical memory.

This part is included in the scope of this paper.

Figure 5.3. The running flow of the HSA Translator starts with the decoding of the directive section. Any instructions in the directive is available code section decoding is invoked. Necessary operands will be accessed through the section pointer in the operations. Finally the re-locatable object is generated after the translation. The current use ISA is X86 because the simulator uses an X86 target for GPU simulation.

5.3 HSA Link-loader

The generated re-locatable object code is placed in the QEMU physical memory.

The design of the LLVM infrastructure is for re-targetable use, thus no linking process is provided. In order to have a linked kernel function binary, we implemented a link-loader called HSA link-loader in the QEMU.

The link-loader does the linking of the helper functions. To resolve the

addresses of the helper functions, linker scans the symbol table of the re-locatable

object code and fills the addresses of the helper functions in the object code. The addresses are found by searching the QEMU binary easily as the link-loader is run within the same QEMU runtime. Addresses remain unchanged in the same runtime.

The helper functions are not preferred to use the in-lining approach because the code sizes of the helper functions are huge. In addition, the functions have to be translated to the LLVM IR first before any linking process. Such translation causes loss information of the global variables and global structures in the QEMU. These reasons give rises to the approach of implementing a link-loader in the QEMU for the HSA simulation.

Because the addresses are not going to be changed during the same runtime, no second linking process should be done to the same object code. Leveraging such linking strategy, the linked object code is stored in the code cache for later use. The linked native kernel function binary is ready to be executed at this moment.

This part is excluded the scope of this paper. On the contrast, we explain the HSA link-loader for giving a more thorough understanding of the translation to the readers.

相關文件