• 沒有找到結果。

In [8, 10, 23, 24, 25, 26] it was demonstrated the importance of data reuse in CMPs (multi/many-core) and SIMT processors as a means to improve performance for different benchmarks and application domains. As noted in [27], understanding data reuse becomes important in many-core systems that are limited by memory bandwidth, as the case for SIMT processors [28]. Exploiting data reuse saves memory bandwidth, because less accesses are required [12, 13, 29]. The capacity limitaiton is significant in SIMT processors, due to the relative small caches for the amount of threads [12, 29]. In [9], Kuo et. al. explain that the capacity constraint can cause contention and destructive sharing in certain cases.

Understanding whether these phenomena are due to application behavior or to limitations of the subsystem is necessary. Also, understanding data reuse is key to improve the management of the memory resources of SIMT applications, which are relatively scarce, but are critical in boosting performance [3].

In [8], Jia et. al. propose a taxonomy of the data reuse behavior based on the abstractions of the execution model and proposed compiler-based techniques to analyze the reuse behavior.

For this, they use the intrinsic relationship of the thread identification mechanism and their portion of the total data set. This approach is limited in that it cannot analyze memory accesses whose addresses are unknown until runtime. Also, they assume that applications running on a GPU present negligible cross-block data reuse. This assumption is valid for a specific set of applications. However, as we have seen in the results section of our work, there are applications that do present considerable cross-block data reuse.

In [9], Kuo et. al. developed a standalone library that builds a hypergraph that represents the data sharing between different run time abstractions of GPUs. Based on this analysis, coding optimizations are performed to CUDA kernels that mimic scheduling mechanisms. The gain in performance is signifcant.

Most of the efforts in data reuse characterization and analysis, such as [8] and [9], are based on static analyses. These works don‟t provide a way to quantify the data reuse behavior, or to characterizae any locality dimensions: whole-program, in program code, in program data, over time (program phases), interaction between programs, as explained in [10]. Arguably, some of

99

these dimensions might not be totally applicable to SIMT environment. For example, analyzing the different phasesof computation might not be efficient or relevant for certaub kernels in SIMT processors, since these kernels are relatively short-lived (when compared to threads in a CMP environment). However, analogous concepts to whole-program locality bare significant relevance, in our view, because of the necessity to analyze memory access patterns and to exploit data reuse of GPU kernels. Our model attempts to create a locality signature of the program totally isolated from particular architectural limitations. In this way, we quantify and visualize the specific reuse behavior of the application. This is useful in assessing the improvement due to code optimizations over the data reuse behavior (temporal/spatial locality) and performance since, as exposed in [30], these two, sometimes, do not relate to each other in a straightforward way.

In contrast to [8] and [9], for our work we preffered not to employ static analyses, using memory traces instead. Our main reason is because such static methods cannot account for the indeterministism of certain portions of the kernel. Memory traces can offer profound insight of the applications, allowing for more refined analyses to capture runtime variations, model them and increase predictability [10]. However, the methodology we propose is relatively straightforward, and has not been extended to provide prediction of any kind. In this work, we attempt to provide the locality signature of the SIMT applications using a new metric: the data reuse degree, and a variation of the reuse distance concept.

Locality characterization of applications using reuse distance profiles, concept introduced by [15] as LRU stack distance, has been widely used to predict cache performance and measure different dimensions of locality. It has been used for systems with serialized memory behavior (corresponding to uniprocessor systems), as is the case in [30, 31, 32] and also for systems in which concurrency in memory access is allowed, such as in CMPs [11, 33, 34]. The reuse distance profiles so obtained can be used to predict cache miss rates, under assumptions of other cache parameters (LRU policy, constant associativity, etc), and analyze different dimensions of locality [10, 14]. Additional complications arise when analyzing applications running on CMP systems, as explained in [11], but prediction is still feasible.

The methodology used to capture program locality based on the data reuse behavior used for CMP systems detailed in [11] is not totally adequate to model locality for SIMT processors.

100

The reason is that private caches in CMP systems are in fact exclusive to each core, whereas the privacy of the caches in SIMT processor is not. In this case, all threads within one core cluster utilize the same caches with a particular portion assigned to them when necessary.

Trying to even perform the analysis on the portion assigned to each thread becomes impractical. This is because threads in SIMT threads are smaller in stream size and are relatively short-lived when compared to their counter-parts in most applications running on CMPs. Our work proposes a new methodology totally different than the use in CMP systems, and proposes a novel approach to analyzing locality in SIMT processors.

In [35], Tang et. al. offer an analytical model based on stack distance to predict cache miss rates on GPUs. Tang et. al. acknowledge the impact that the programming model of SIMT processors has on analyzing locality. Also, they consider very specific constraints and characteristics of the memory susbsytem (Effect Point) coupled with program behavior (Access Point) to perform the stack distance profiling. Their work is focused on predicting miss rates and the ocurrence of contention. The histograms obtained by this methodology are highly dependent on the cache parameterers assumed (associativiy, replacement policy).

Therefore, it is not inherent to the kernel itself. Also, they assume that cross-block data reuse is negligible. As mentioned before, the validity of the assumption that there is no cross-block reuse depends on the particular kernel.

We attempt to improve over the approach proposed in [35] by quantifying the data reuse on per memory instruction basis, and building a histogram that captures the data reuse between two memory instructions at varying distances in an efficient way. In contrast to [35], we don‟t build stacks and traverse them in every access to build the histogram, since this makes problematic the creation of the histograms. In Section 2, we explained that this approach can alter the actual reuse distance depending on the order in which the addresses traverse the stack (or any other data structure for that matter). This problem is not addressed at all in [35].

Another improvement when compared to [35], is that we perform our analysis independent on particulars of the cache parameter, considering the programming model and the amount of available parallelism given the code structure of the kernel.

101