Background - 通用式圖形處理器上考量快取記憶體行為之多執行緒決定機制

2.1 Baseline GPGPU architecture

The baseline GPGPU architecture used in this thesis is shown in Figure 3. It is modeled by a cycle accurate GPGPU simulator, GPGPU-Sim [13]. The configurations of the architecture adopt NVIDIA’s Fermi GTX480 [24]. More detail information of this simulator can refer to its manual [14]. The GPGPU model consists of 15 shader cores, a centralized CTA scheduler, a unified L2 cache shared by all shader cores, and 6 DRAM partitions. Each shader core has a 32-way pipeline, a private read only L1 cache, a CTA buffer, and a warp scheduler.

The default size of the CTA buffer can accommodate 8 CTAs.

The private L1 data cache of each shader core is read-only. All the write requests would be view as write-misses and the target cache line would be evicted from the cache to the lower level memory hierarchy. The unified L2 data cache is writable, and it is divided into 6 memory partitions, which is connected to one controller and one DRAM partition. The cache lines of both L1 and L2 cache can be accessed in a single transaction. This feature is referred as memory coalescing while all the requests to the same cache line can be serviced at once.

The cache line size is set to 128 bytes, designed for a memory requests from a warp with 32 threads. If some threads in a warp do not access data within the continuous addresses 128 bytes, there would be additional memory accesses until all the requests are fulfilled. These un-coalesced accesses would increase the memory access latency and waste cache space since some data in the cache line may be useless.

To comprehensively understand the performance effects of different multithreading degrees, this work first evaluates the system behavior with no L2 cache, and only enables the private L1 cache. This is mainly because the private L1 cache is easier to control and observe the effects of different multithreading degrees. The unified L2 cache needs to support all the

- 6 -

shader cores, therefore the multithreading from different shader cores together would have more complex and correlated effects. For instance, the inter-block data sharing issue is mainly determined by hardware scheduler at runtime and relatively difficult to predict. If the workloads have no inter-block data locality, then the behavior of L2 cache would be very similar to L1 cache, except for supporting memory requests from all the 15 shader cores by larger cache capacity. Therefore, to simplify the experiments, L2 cache is disabled in this work. A detailed baseline platform configuration is described in Table 8 at section 6. Except for L2 cache, the experimental environment is the same with the default configuration of GTX480 in GPGPU-Sim v.3.2.1.

Figure 3: Baseline GPGPU architecture

- 7 -

2.2 CTA Scheduler and Warp Scheduler

The centralized CTA scheduler dispatches the CTA to shader cores. It is simulated as the NVIDIA GigaThread Scheduler in the Fermi architecture. Before CTA dispatching, the scheduler checks the maximum number of CTAs that can be supported by each shader, including the number of registers and the capacity of shared memory. For example, if a CTA needs 20 registers, and there are total 60 registers in one shader core, the maximum number of CTAs is 3.

The warp scheduler of each shader core controls the instruction issuing. It decides the order of warp selection in the CTA buffer, and it checks the status of instructions belonging to the selected warp whether some of them are ready to be issued. One common warp selection order is the round-robin scheme, and other scheduling algorithms are proposed for different purposes. For example, Jog et al. [10] proposed a two-level round-robin scheduling policy to improve the efficiency of hiding instruction latency. Rogers et al. [8] proposed a scheduling scheme to capture the intra-CTA data locality. This work only modifies the CTA scheduler to achieve the goal of controlling multithreading degree of each shader core. The warp scheduling policy is greedy-then-oldest (GTO) scheme, which is implemented by the GPGPU-Sim. The GTO is a widely used scheduling scheme [9] and performs well compared to the simple round-robin scheduling scheme.

- 8 -

2.3 Workloads and Metrics

In this work, some synthetic benchmarks are used to demonstrate the benefits of cache hierarchy design and the multithreading technique. The proposed decision scheme of multithreading degree is evaluated by some GPGPU applications that are implemented by CUDA programming interface, including in-lab benchmarks [19] and Rodinia [15, 18].

The in-lab benchmarks target on massive memory access pattern, and the working set sizes of some benchmarks are bigger than the L1 data cache, inflicting high pressure. The details of this bench suite are shown in Table 1. Another bench suite, Rodinia benchmarks, comes from many fields, including image processing, data mining, scientific simulation, and linear algebra. These benchmarks target on computation intensive workloads with relatively smaller pressure on the memory system. Table 2 shows some details of Rodinia benchmarks.

Parboil [16] suite is another famous GPGPU bench suite, but it is not suitable for the performance evaluation in this work because its kernel implementation can only support up to 80 CTAs. The kernel functions of Parboil only execute single iteration of CTAs. This feature does not fit our experiment setup since the proposed approach in this work would evaluate the multithreading degrees in the first iteration, and then decide the best multithreading degree for the following iterations of CTA computation.

This work focuses on the overall performance improvement that is measured by instructions per cycle (IPC). The multithreading degree is defined as the total number of active warps as shown in equation (1)

multithreading_degree = (effective CTA buffer size) × (#warps per CTA) (1)

- 9 -

Table 1. In-Lab Massive Irregular Memory Parallel Applications [9]

Applications Domain Descriptions Source Data Set

Sizes

sta Static timing analysis 3.0 MB

gsim gate-level logic simulation 3.5 MB

nbf Molecular

Dynamics

kernel abstracted from the GROMOS code Cosmic [20]

6.3 MB

moldyn force calculation in the CHARMM program 10.2 MB

irreg Computational

euler finite-difference approximations on mesh Chaos [21]

8.5 MB

unstructured fluid dynamics with unstructured mesh 10.2 MB

Table 2. Rodinia Benchmarks [15, 18]

Applications Domain Descriptions Data Set Sizes

BFS Graph

Needleman-Wunsch Bioinfomatics Dynamic Programming 4 MB

Pathfinder Grid traversal Dynamic Programming 9.5 MB

Srad_v2 Image

Processing

Structured Grid 4 MB

- 10 -

在文檔中通用式圖形處理器上考量快取記憶體行為之多執行緒決定機制 (頁 15-20)