• 沒有找到結果。

2. Background and Related Work

2.1. Graphics Processor Unit (GPU)

The GPU technology has changed a lot in a last few years, both in hardware and software.

The hardware structure has transformed from fixed function rendering pipeline to programmable pipeline consists of multiple SIMD processors. The computation power gap between CPU and GPU is getting larger and larger, as described in Fig.

2-1.

Fig. 2-1 Floating-Point Operations per Second for the CPU and GPU

For software, OpenGL persists but several extensions added by OpenGL ARB [4]

for usage of the programmable pipeline. Base on programmable pipeline and the SIMD architecture of current GPU, there is a new kind of computing method called general computation on graphics hardware, also called GPGPU [5][6].

7

The concept of the computing method is first to map a compute-intensive problem into multiple small pieces, then solve the small pieces within pixel-rendering context (this is now programmable) and finally store the result into frame buffer object.

This method has extremely good performance for compute-intensive problem, like matrix operations [7][8]. However the implementation is hard for programmers not familiar to graphics processing because of the texture processing. For the general processing purpose on GPU, the two largest graphics manufacturers, NVIDIA and AMD/ATI, give different answer to developers toward GPGPU. NAVDIA proposed Compute Unified Device Architecture (CUDA) while AMD/ATI threw Close-To-Metal (CTM), which becomes AMD Stream technology later. Here we give a briefly introduction for these two technologies.

2.1.1. Compute Unified Device Architecture (CUDA)

For general purpose computing usage on GPU, NVIDIA released CUDA for developers to speed up there program on their G80/G92 based GPU. It‟s the architecture that unifies the general computation model on graphics devices. CUDA use C language with extensions from C++, such as template and variable declaration in C++ style. This makes a great convenience for programmers. Here we give more details for CUDA because the platform is entirely implemented in CUDA.

CUDA is very different from OpenGL and other programming languages.

Compared to traditional GPGPU approach, the most important advantage of CUDA is the “scatter write” capability. Scatter write, by definition, is that in CUDA code you can write any data to arbitrary address in GPU memory, which is not possible in traditional GPU shader programming. With this capability, many of parallel

8

algorithms are possible implemented on GPU, such as parallel prefix sum and bitonic sort.

In addition, for communication and synchronization, there is shared memory on each multiprocessor, which can be shared among sub-processors. Shared memory has extremely high bandwidth compared to onboard GPU memory and can be used as user-managed cache.

Furthermore, as stated in CUDA official document, GPU is capable of executing a large number of threads in parallel with very low context switch overhead as shown in Fig. 2-2. From top to bottom, each CUDA kernel be can executed by a grid of blocks, and each block is composed of a number of threads. Each block is conceptually mapped to a single SIMD processor; however it‟s still possible for different threads to take different branch and set synchronization barriers. Threads inside the same block can communicate with others using per-block shared memory

and synchronize when needed.

Fig. 2-2 GPU Devotes More Transistors to Data Processing

9

Table. 2-1 Memory Addressing Spaces Available in CUDA

Except for shared memory, actually there are six different types of memory for different purpose and access pattern, listed in Table. 2-1. The GPU memory layout is shown in Fig. 2-3. In fact, physically there are only two types of memory physically available on GPU: on-chip memory and off-chip memory. But for difference usage, they are divided into difference memory space and optimized for different purpose.

Name Accessibility Scope Speed Cache

Register read/write per-thread zero delay (on chip) X

Local Memory read/write per-thread DRAM N

Shared Memory read/write per-block zero delay (on chip) N

Global Memory read/write per-grid DRAM N

Constant Memory read only per-grid DRAM Y

Texture Memory read only per-grid DRAM Y

10

Fig. 2-3 CUDA Programming Model

The on-chip memory is embedded into the SIMD processor. This make the access extremely fast, which takes only 2 clocks to read or write when no bank conflict is occurred. Compared to on-chip memory, access to off-chip memory is very expensive which takes 200~300 clocks each operation. On-chip memory, which is usually refer to shared memory, is typically used as a user-manageable cache for SIMD processors to avoid duplicated access to off-chip memory. Also, shared memory can be used as for inter-communication among threads in the same block.

CUDA seems to be the best choice for parallel programming, but there are some limitations due to hardware design. First of all, the precision of floating point computation is limited. Only single precision floating point is supported right now. .

11

Second, there‟s no recursive function allowed on GPU, simply because stack is not presented on GPU. Third, the branching of threads within same block has some performance issue because that each block is a single SIMD processor which can only execute a single instruction at a time. If different threads take different branch paths, they will be serialized by the thread scheduler on GPU, thus penalty is introduced.

Fourth, the data transfer between CPU and GPU memory is relatively slow due to the limited bandwidth of PCI Express bus. For more details, please refer to official CUDA programming guide [9].

2.1.2. Close to Metal (CTM) (AMD Stream™)

In comparison with CUDA, AMD/ATI announced the Close To Metal (CTM) [10]

technology before CUDA. For similar goal of CUDA, CTM try to open the computation power of GPU and apply to general purpose computation. But instead of CUDA, CTM comes with no comprehensive toolkits. There are no compiler, linker and high-level language interface but only low-level, assembly like raw commands executable on AMD/ATI‟s GPU. The differences set a barrier for developers and make it less-attractive, even though the CTM covers almost all aspects that CUDA can do.

Recently AMD made a swift to its strategy to add a higher level abstraction over CTM called Compute Abstraction Layer (CAL), and combine with Brook+ as AMD Stream Computing SDK. However, the development progress is still slow compared to CUDA and Brook+ is not as flexible as CUDA. That‟s why we choose to use CUDA finally.

相關文件