Related Work - 通用式圖形處理器上考量快取記憶體行為之多執行緒決定機制

In this section, some previous researches that focused on the CTA/warp scheduling scheme will be discussed. They targeted on different issues, including branch divergence, hiding memory latency, and cache locality optimization. Only few of them address the trade-off between multithreading and cache contention.

Cheng et al. [2] transformed the CMP applications into the streaming programming model that decouples computation and memory accesses. The approach in [2] extracts tasks from a loop that can be entirely unrolled. The execution flow of a for-loop is decoupled to a memory task and a computation task. This decouple scheme is inspired from the CUDA programming interface, which allows higher controllability of both computation and memory tasks. The memory tasks and the computation tasks are all equally-sized. The computation task and memory task decoupled from the same loop is bound as a dispatching unit. Then, the analytical decision scheme monitors memory bandwidth demand of memory tasks, and it estimates the latency of memory accesses considering different level of memory parallelism.

The analytical model adjusts the parallelism degree gradually based on the execution time of memory tasks. In contrast to the work in [2], the work in this thesis focuses on the correlation between the multithreading degree, system performance, and cache behavior. This work uses the correlation to predict the performance at different multithreading degrees. The approach in this work does not need a progressive adjustment scheme, and therefore it takes much lower overhead for the adjustment progress.

Hong and Kim [3] proposed a performance analytical model for GPGPUs. They discussed the efficiency of memory latency hiding by the multithreading technique. The key component of this approach is to estimate the number of parallel memory requests by considering the number of running threads and memory bandwidth. Based on the degree of

- 11 -

memory parallelism, their model evaluates the system runtime considering the multithreading effect, coalescing of memory accesses, and synchronization overhead. However, their work did not consider the possible side-effect of multithreading, such like memory bandwidth limit that could lengthen the latencies of memory accesses. In addition, their work did not consider the cache hierarchy, and there is no discussion about the overheads of cache conflicts. This thesis gives a comprehensive discussion on various design concerns and trade-offs between cache performance and multithreading degree. The key issues discussed in this thesis includes the cache conflicts, the increment of memory requests, the latencies of memory requests and the effect of pipelining. These aspects are not addressed in the work of [3].

Fung et al. [4,5], Narasiman et al. [6], and Meng et al. [7] proposed warp scheduling schemes to solve branch divergence issues on SIMD GPGPU computation. Their approaches dynamically reformulate warps from all ready threads with the same PC (Program Counter), which expects warps to fully utilize all pipeline lanes. The work [6] also targeted on the improvement of hiding memory accesses through multithreading. They modified the warp hierarchy as a two-level design, and each warp has different priority to issue instruction. It prevents all warps from reaching the long latency instruction concurrently. In this thesis, the branch divergence is not under consideration, and the built-in warp scheduling policy and branch control on GPGPU-Sim is adapted directly.

Rogers et al. [8] proposed a warp scheduling scheme considering the data reusing on data caches, named as Cache Conscious Wavefront Scheduling (CCWS). It uses additional on-chip storage to record the reusing status of data on each cache line, and scores all warps in the CTA buffer for prioritization. They concerned the cache contention issue since the data locality can be destroyed by multithreading when cache capacity is not enough. They implemented an effective approach to control the multithreading scheme. Even though they demonstrated the effect of multithreading on their system performance, there is no further

- 12 -

discussion about the trade-off between multithreading and cache contention. This thesis does not modify the warp scheduler and the data locality improving scheme, but controls the multithreading degrees through CTA scheduler. Besides, this thesis demonstrates various trade-offs between mulithtreading degrees and the contention in memory system.

Kuo et al. [9] proposed a cache contention aware locality optimization scheme on GPGPU by software. They formulate the cache conflict issue of the unified last level cache, and proposed methods to pack CTAs, whose total working set size is smaller than a bound decision model that considers the relation between system performance and multithreading degrees.

Jog et al. [10] proposed a work that discusses several performance issues on warp scheduling scheme of GPGPUs, including CTA-aware two-level warp scheduling, locality aware warp scheduling, DRAM bank level scheduling, and the prefetching scheme on memory side. The proposed two-level warp scheduling policy bound a number of CTAs as a group. The key idea of their idea is to set different priorities for warps in different groups. A higher issue priority is set on warps in the same group. Unless all warps in the current group are idled, the scheduler only issue instructions from warps in the current active group.

However, this approach is not effective since the total number of candidate warps is far from enough to hide the memory access latencies, which are usually around hundreds of cycles. In our work, a direct limit of multithreading degree is set in CTA scheduling policy based on a

- 13 -

prior estimation, and it demonstrates significant improvement in system performance of memory intensive benchmarks.

- 14 -

在文檔中通用式圖形處理器上考量快取記憶體行為之多執行緒決定機制 (頁 20-24)