OVERVIEW OF SIMT PROCESSORS - 以記憶體追蹤方式在單指令多執行緒架構中資料分享程度之分析研究

This section presents general background on SIMT processors. We take as our main reference current state-of-the-art GPU architectures. Thus, we provide general information on their hardware specifications and programming abstractions.

2.1 Hardware of GPU Architectures

Figure 2(a) presents a diagram of the architecture of a core cluster or, as NVIDIA calls it, a Streaming Multiprocessor (SM). The diagram presented is based on NVIDIA‟s Kepler GeForce GTX 680 GPU [12]. In this figure, the elements with the subscript “Core” represent the CUDA cores, which are the basic processing units inside a GPU. Core cluster are groups of these small cores.

64 KB Shared Memory / L1 Cache Uniform Cache

Figure 2: Diagram of a core cluster based on NVIDIA's Kepler GeForce GTX 680 GPU.

(a) The core cluster with its internal hardware modules. (b) Illustration of the thread hierarchy in the SIMT programming model.

The number of cores in each cluster varies depending on the family of the GPU, but they are usually grouped by numbers of powers of 2. In the case of the GTX 680, there are 8 core clusters, arranged in groups of two, forming 4 separate groups. The task allocation to each

core cluster is handled by a thread block scheduler, which appears as the GigaThread Engine in Figure 2. This module issues a group of threads to each cluster based on a task allocation policy.

Each cluster has private caches that only the threads executing within it can access. Figure 2(a) also shows an L2 unified cache. This L2 cache is shared by all the threads running in all core clusters present in the GPU. Four memory controllers handle the access to the off-chip memory, which perform memory scheduling and coalescing techniques.

Every GPU has a PCI Express interface which is the bus that connects the GPU device to its CPU host. It is the CPU that launches the execution of applications in the GPU and transfers all the data to the GPU memory. Recent generations of GPUs are able to initiate tasks created autonomously [12]. The CPU offloads work into the GPU in order to accelerate the execution of highly parallel portions of applications, leveraging the latter‟s processing power

In Figure 2(a), there‟s also an array of texture units, a texture cache, a configurable shared cache and L1 cache, a uniform cache (for constant variables) and an interconnection network.

The latter provides an interface for the core clusters to move data to and fro the L2 unified cache and the off-chip memory. It is important to stress the fact that there are no coherence or consistency models implemented in the programming model of the GPUs [13].

2.2 Programming and execution abstractions of GPU

In GPUs, threads are the smallest unit that can be executed. These are grouped obeying a hierarchical scheme that facilitates the task allocation from core clusters down to each individual core. Tasks issued to a GPU for execution are represented as a conglomerate of threads grouped into grids consisting on thread blocks, which are further divided into smaller groups of threads called warps [13]. This outlines the thread hierarchy inherent to the runtime model of the GPU. Figure 2(b) presents the thread hierarchy as previously described.

Each warp inside a block can have up to 32 threads in current state-of-the-art NVIDIA GPUs.

The number 32 is chosen because it facilitates the management of the memory accesses by the memory subsystem [13]. Each warp of 32 threads executes in lockstep, which means that they execute the same instruction over different portion of data. The instructions they execute are the ones conforming the kernel code. Each thread executes the kernel code, but each thread

works over totally or partially mutually exclusive subsets of the data. Because of this fact, it is said that GPUs apply an SIMT execution model.

The threads can be arranged in multidimensional arrays, and so they are grouped into warps, which conform the blocks, as mentioned before. Each warp has a warp ID. Inside these warps, each thread also possesses a unique ID, which becomes useful to associate it to the data portion that it uses.

2.3 Memory Hierarchy of GPUs

The GPUs memory hierarchy is very particular, and it is somewhat suited to fit the needs of the programming model just described in the previous section. The memory hierarchy of the GPUs has 6 different memory spaces: register, local, shared, global, constant and texture.

The different spaces serve different purposes. The constant memory space is read-only memory used to store constants, parameters and data types declared as un-modifiable by the CUDA programming model. The texture cache is used to store texture and surface [13] data in a non-inclusive way: texture data is EXCLUSIVELY stored in the texture cache. The registers are assigned to each thread so these can store operands and perform calculations. The local memory space is a portion of the memory assigned to each individual thread, to which it can write or read information as the computation progresses. Also, it can use this space to spill registers when exceeding the register quote. The lifetime of this memory space lasts as long as the thread is active. The shared memory space can be accessed by all threads within a block and it is managed explicitly by the programmer. This space expires from the memory as soon as the block finishes execution in the SM. The global, constant and texture memory spaces remain in place even after the kernel has finished execution, or other kernels are launched into the GPU.

Understanding the details of the memory hierarchy of these processors is fundamental to comprehend the complexity of the locality characteristics. However, as it will be explained, the locality behavior of applications depends on multiple factors, starting from the resource and parallelism availability. This is the central point of the analytical models proposed in this work.

III. LOCALITY ANALYSES IN CMP AND UNIPROCESSOR SYSTEMS

在文檔中以記憶體追蹤方式在單指令多執行緒架構中資料分享程度之分析研究 (頁 24-27)