• 沒有找到結果。

One natural question to be answered is that ”why there is such a large peak-performance gap between CPUs and GPUs ?”. The answer lies in the difference in the fundamen-tal design targets and the transistors which are distributed differently between these two processors [94]. Fig. 6.3 shows a comparison of the general basic structures of a typical CPU and a typical GPU [94]. In Fig. 6.3, orange region represents transistors devoted to memory, yellow region represents transistors devoted to control logic units, and green region represents transistors which are dedicated to performing arithmetic. Larger

re-gions represent more transistors. The CPU is composed of arithmetic logic units (ALU), control logic units and large cache memories is connected to a large amount of Dynamic Random Access Memory (DRAM), of size in the order of gigabytes (GBs). It is proven that the design of CPU is optimized for sequential code performance since it makes use of sophisticated control logic units to allow instructions from a single thread3 to execute.

More importantly, cache4memories are adopted to reduce the instruction, data access la-tency5of large complex tasks. On the other hand, the GPU is designed to perform pure arithmetic operations There is no need to devote a large amount of transistors to control unit, branch predictor and cache, etc. Most of the transistors can be, therefore, used for units which perform arithmetic operations. This design makes the GPU be optimized for executing parallel computing tasks. Hereinafter, the term ”device” refers to the GPU and the term ”host” refers to the CPU will be used.

The revolutionary era of GPU computing started with the G80-series processor by NVIDIA. Later, the GT2000 processor which firstly supports double-precision floating point number that complies with IEEE754 standard [94] was introduced in 2008. Apart from this, a huge amount of data parallelism and data coalesced access is required in order to take advantage of GPU’s architecture to achieve high performance. In 2012, the Kepler-series GPU card with GK110 processor was presented. In this dissertation, two different Kepler-series GPU cards, K20 and K40, will be adopted. The detailed features of the adopted Intel CPU, K20 and K40 GPU cards are e tabulated in Table. 6.1.

Taking the architecture of Kepler K20 GPU card as an example. Fig. 6.4 shows the basic structure of a CUDA-capable K20 GPU architecture. The K20 GPU has its own DRAM memory that can be up to 5GB to exchange data between the CPU and GPU via the PCI-Express interface. There are different memories in GPU that can be used and controlled by the programmer, from fast to slow memories, depending on the data to be accessed. In order to get a higher performance, some memory access pattern must be

3A thread is an element of data to be processed

4Cache is a high-speed memory used to reduce the time to access data from the main memory

5Latency can be regarded as the time between a task initialization and the time it begins to execute it.

followed. Note that not all algorithms can take advantage of these fast memories.

The GK110 processor in K20 consists of 13 streaming multiprocessors (SMXs). Each SMX is composed of 192 single- and 64 double-precision cores, 4 warp scheduler, 4 dispatch units, 32 special function units (SFUs) responsible for the calculations of some special mathematical functions (e.g. sin, cos, exp,etc). The warp scheduler and the dis-patch units enable launch of four warps concurrently. The 32 LD/ST units (Load/Store units) are responsible for a load and store processing. The K20 GPU is clocked at 705 MHz which achieves total single- and double-precision floating point peak performances at 3.52 and 1.17 TFLOP/s, respectively. Each SMX has its own on-chip memory of size 64KB that can be accessed both explicitly as the shared memory and implicitly as the L1 cache. One can refer to NVIDIA Kepler whitepaper [95] for getting more detailed features.

6.2.1 Memory hierarchy

The memory hierarchy in GPU is composed of some different types of memory ranging from a small size with low latency to a large size with large latency. On-chip there are registers, shared memory, and various caches (L1, L2, constant, and texture). Off-chip there are global, local, constant, and texture memories. Local variables defined in device code are stored in registers provided that there are sufficient registers available for use.

If there are no sufficient registers, data will be stored in local memory which seriously reduces the performance. Shared memory belongs to the on-chip read/write memory and is accessible only by threads within a block and has a latency of only 1-2 clock cycles.

Note that the benefit gained from shared memory can be only obtained if the number of arithmetic operations is larger than the number of memory accesses. The L1 cache consists of programmable shared memory and a general purpose cache. The latter is used to accelerate random access operation. The size of the total L1 cache is 64 KB in Kepler-series GPU.

Constant memory can be accessed and written from the host code, but is accessed-only from threads in the device code. It is cache on device and is the most effective when

threads execute the same value at the same time. Texture memory is similar to constant memory in that it is accessed-only by device code. It is simply a different pathway for accessing global memory, and is sometimes useful to avoid large global memory access latency. The last level in this hierarchy is the off-chip global memory. It can be accessed by all threads in GPU with the disadvantage of high access latency cost (hundred of clock cycles). One strategy to hide access latency is to increase the parallel working threads.

Most of the data are usually stored in global memory since the size of global memory is bigger than other memories. This can be achieved when dealing with huge amount of data.

One can refer to Table 6.2 for getting more key features of different device memories.

相關文件