The last decade has seen an increase in the processing demand in the different computing markets [1]. This has made necessary the introduction of novel computer architectures to satisfy the exponentially increasing processing needs of the end users. As a consequence, heterogeneous computing systems [2] have risen as commercially available solutions. These systems rely on one or more processing accelerators that are able to perform certain tasks within the users‟ applications faster and more efficiently. SIMT architectures are one of the most common many-core/multi-threaded processing accelerators. SIMT stands for Single Instruction – Multiple Threads. These processors are able to handle a relatively large amount of execution contexts simultaneously. Within this scope, GPUs are the most popular and widely used.
The current trend is to utilize these heterogeneous computing systems for a wider range of scientific computing applications and other general purpose tasks. To do this, it is necessary to understand the particularities of the processing accelerator. Thus, programmers are required to consider key architecture details at the software design stage. In addition, a thorough understanding of the application‟s characteristics and its interaction with the architecture is necessary to fully exploit the processing power of the accelerators. This is particularly delicate for SIMT processors.
The performance of an application executing on SIMT architectures, such as GPUs, is significantly dependent on its locality characteristic, resource utilization, control flow behavior, among other things [3]. The locality characteristic is dependent on the memory access patterns of the application. Considering these patterns and the details of the underlying memory sub-system is critical to boost performance. This is because the memory sub-system is the principal performance bottleneck [3]. Applications for SIMT architectures are extremely sensitive to memory utilization resources.
Many efforts already exist that have characterized the applications running on SIMT architectures [4, 5, 6]. Most of these works define a set of metrics (percentage of branch divergence, branch predictability, dynamic instructions, memory intensity, etc.), and observe the values of the metrics produced by each workload after conducting a series of simulations over real GPUs or simulators [7]. There have also been efforts to characterize the locality of
2
applications [8]. These works carefully explore the relationship between the execution model of the architecture and the data sharing of the application [9]. Such works are able to leverage the data sharing of the thread at different levels of the thread hierarchy in the SIMT architecture, and provide guidelines based on this information to improve performance. In this work, we use the terms data sharing between the threads and data reuse between the threads interchangeably.
The data reuse behavior of applications deserves particular attention. As Figure 1(a) shows, one of the benefits of taking advantage of the data reuse is the increase in memory coalescing.
When threads request data from the off-chip memory, their accesses are said to be coalesced when many memory requests can be served in one single off-chip memory transaction. This happens when accesses are to contiguous or identical addresses. Memory coalescing is not possible when the memory accesses are too scattered. This makes either necessary additional off-chip memory transactions or increases the latency of transactions if caching is present.
Thus, performance is reduced.
Another benefit is the avoidance of contention, illustrated in Figure 1(b). Contention occurs when data is evicted between two sub-sequent requests to the same data. In Figure 1(b), an example is presented for a CMP. First, processor P0 requests a data from memory and uses it.
Then, processor P1 requests data of its own that causes the eviction of the previous data requested by P0. If P0 requests that data again, P0 will be stalled fetching the same data to the off-chip memory a second time. If these series of events repeat frequently during the application‟s execution, then it is said that contention is present. Contention harms performance significantly, since the latency required to fetch data to off-chip memory is an order of magnitude higher than fetching data from on-chip caches.
3 applications. (a) DRAM memory transactions with and without coalescing. The first two cases from the top illustrate the case for coalescing. The last case shows the case when coalescing is not possible. (b) Illustration of the contention effect in a CMP.
The impact of the data reuse over performance is significant [8, 9, 10]. For the case of SIMT processors, there‟s a need for architecture-agnostic analyses to assess qualitatively and quantitatively the locality characteristics of applications, in particular the data reuse behavior.
Modeling the inherent large amount of parallelism in SIMT applications and its impact on the data reuse behavior of the applications is the main motivation behind performing such analyses. The existing methodologies to perform locality analyses used for applications running on CMP systems, such as the reuse distance analysis, are not appropriate for SIMT applications. The main reason for this limitation is the difference in the execution model.
Reuse distance analyses on CMP systems consider implementation details of the architecture in order to maintain accuracy [11]. In these analyses, locality is measured from the perspective of the memory subsystem, keeping track of the addresses accessed. These analyses model the effects of thread interference and amount of processor cores, which defines the total amount of threads running simultaneously. However, the locality measurements obtained with this methodology are heavily dependent on the configuration of the on-chip memory subsystem, and are affected by factors such as the type of task scheduling and allocation. The architectural agnosticism is sacrificed, but these analyses are still very valuable for memory subsystem design, to predict cache miss rates and estimate performance
4
When applying the previously described methodologies, the locality measurements are not solely of the application, but are of the application interacting with a memory subsystem that has specific characteristics. This methodology becomes inappropriate for SIMT processors, since it does not consider the particular execution model of the latter and does not consider its inherent large parallelism. Also, the memory subsystem in SIMT processors has different characteristics than their CMP counterparts, which imposes the need to develop better suited analysis methodologies.
In order to quantify the locality characteristic of SIMT applications in an integral way, it is necessary to abstract the analytical model from the implementation details and practical limitations of SIMT processors, and perform the analysis as closer to the application itself as possible. Analyses performed under such conditions would show the locality characteristic particular to an application in a self-contained, abstract and truly architecture-agnostic way.
This would allow us to measure, as isolated as possible from implementation details, the changes of the locality characteristic under different runtime scenarios and optimizaitons.
Once this has been quantified, the locality can then be measured in relation to other factors of the SIMT execution model (scheduling, allocation, pipeline length, etc.) and the limitations of commercial architectures.
In this work, we develop a methodology to analyze and quantify, while offering a graphical representation, of the data reuse behavior of SIMT applications under different execution conditions. For the characterization of the data reuse, we define a new metric: the data reuse degree, and also, we redefine the reuse distance concept in order to employ it in our analyses.
We measure the reuse degree in the reuse distance domain of an application‟s kernel, assessing how significant the data reuse is at different segments of the application. We also obtain the data reuse characteristic for different kernels when modeling different abstractions of parallelism, which gives a clear idea on the manageable locality as processing resources are constraint.
The contributions of this work are as follows: 1) we provide a new analytical model for the analysis, quantification and to graphically represent the data reuse behavior of SIMT applications that is solely application dependent and architecture-agnostic, 2) provide a methodology that captures the data reuse behavior of SIMT applications under different types of parallelism constraints, from an ideal case where parallelism capabilities are infinite down
5
to more realistic scenarios, 3) we provide a new way to identify an application‟s access patterns, embodied in its data reuse characteristic, 4) we show the changes on the data reuse characteristic when coding optimizations are performed, 5) develop a flexible framework that enables to analyze the effects of that certain implementation details of SIMT architectures (scheduling, allocation, number of core clusters) have over the reuse characteristic.
This thesis is organized as follows. Chapter 2 gives an overview of SIMT processors. It explains the abstractions of the programming and execution models, and describes very briefly the architecture of a commercial SIMT processor. Chapter 3 explains current state-of-the-art locality analyses. Their limitations are explained when trying to use them as such when analyzing applications SIMT processors. Chapter 4 develops our new model for characterizing the data reuse, and formally defines the data reuse degree and the reuse distance. Chapter 5 explains the methodology used to perform the analyses. Chapter 6 details the different conditions under which the data reuse characteristic is obtained. We vary the amount of available parallelism, and a different reuse characteristic is obtained for each case. Chapter 7 explains with luxury of detail the framework developed to perform the analysis. Mostly programmed in C++, we show the algorithms it has and the elements that were modeled.
Chapter 8 describes the coding optimization techniques performed over the benchmarks we use for our experiments. These optimization techniques are taken from [9], and are used in our experiments to observe the change on the reuse characteristic after applying them. Chapter 9 shows our experimental results. In Chapter 10, the related work is presented. Chapter 11 concludes this work.
6