This section gives a brief explanation on the coding optimizations applied to the applications in our experiments. By analyzing the changes on the data reuse characteristic after optimization, assuming invariable the resources to exploit parallelism, it is possible to compare the effectiveness of the optimization procedures. These coding optimizations follow the thread mapping methodology for multi-level shared caches proposed in [9]. This methodology performs different optimization procedures over the application: thread clustering, warp clustering and block scheduling.
8.1 Thread Mapping Methodology
The coding optimizations are implemented as a stand-alone library. The procedure is performed by the CPU host. The core of the procedure consists on manipulating the threads within one kernel, modifying the baseline order of execution. To do this, it is necessary to first analyze the data accesses of an application. For this purpose, in [9] the compressed sparse row (CSR) format is used, since it facilitates the manipulation of the data accesses.
The thread mapping on the GPU side is accomplished by applying similar techniques to those of data and computation reordering [9]. Since GPUs do not allow programmers to explicitly schedule threads or thread blocks for execution, then it is necessary to implement an indirect way to mimic this behavior. Thus, the methodology makes use of appropriate data layout techniques coupled with the re-mapping of thread indexes [9]. The way to do this is basically to re-arrange the data contained in the data structures and associated arrays (array of structures) for threads to then access the data in a different sequence as initially expected [9]. A new thread mapping array is constructed that maps the baseline thread index of each thread to a different value, equivalent to a function that can be expressed as:
where represents the new thread index, „ ‟ represents the baseline thread index assigned by the GPU, and f() is the function that performs the mapping.
53
By making use of these two ordering techniques, it is possible to manipulate the formation of warps and blocks, and mimic alterations in the order in which they are issued for execution.
The resulting behavior is to coordinate better the accesses in order to exploit the benefits of data reuse [9].
The methodology proposed by [9] consists on the following steps:
1. Generate information about the data utilization and architectural parameters of the SIMT processor: to obtain information about the data utilization, a data sharing and volume graph is generated. The task of gathering architectural parameters of the SIMT processor is trivial for our purposes.
2. Thread and Warp Clustering: threads are grouped into warps such that the resulting warps issue the minimum number of memory transactions.
3. Thread Block Scheduling: tries to schedule in adjacent issue slots the thread blocks that present significant data sharing.
4. Resource utilization throttling stage: the depth of multithreading is throttled to use the shared cache and avoid contention.
The data sharing and volume between the threads both are modeled as a hypergraph called Data Sharing and Volume Graph (DSVG), define as [9]:
in which „V’ is the set of vertices, „E’ is the set of hyperedges, „ ‟ is the vertex weight, and
„ ‟ is the weight of the hyperedge. In this model, threads are represented as vertices within the DSVG, and the associated weight represents the amount of private data of the thread. A hyperedge represents threads that share data, and the weight associated with the hyperedge indicates the amount of shared data. The data sharing within a set of vertices i.e. a set of threads, is represented by the group of hyperedges incident to a vertex. Likewise, the data sharing involving other sets correspond to external hyperedges relative to a given set.
Thread clustering forms warps of threads so to minimize memory transactions. The warp clustering step builds blocks based on these newly created warps into thread blocks with an
54
increased amount of data sharing. The Thread Clustering and Warp Clustering techniques can be reduced to the hypergraph partitioning problem, a well-known NP-hard problem [9].
Thread Block Scheduling arranges the issuing order of the blocks with the objective to reuse data through the L2 cache. Using the projection of the result generated by the Thread and Warp Clustering, a DSVG is obtained that enables a formulation of the Thread Block Scheduling Problem. In this case, each vertex in the new represents a thread block, and a new function is defined that maps each vertex to an integer, in a one-to-one relationship. The integers represent the scheduling sequence.
As previously explained, the reuse distance is the number of distinct memory accesses between references to the same shared data. The Thread Block Scheduling uses this definition to generate a new metric: the total reuse distance. This metric is the sum of all the reuse distances that appear as a result of a specific scheduling function applied over the vertices of the hypergraph. Therefore, a mapping function needs to be selected to minimize the value of that sum. The Thread Block Scheduling problem is actually a general version of the vertex ordering problem [9].
The fourth stage of the methodology, Resource Throttling, finds the best way to utilize the last level shared cache [9], currently the L2 cache in SIMT processors. This last step is not included in our experiments because we use model the memory subsystem as ideal. As previously explained, this work only models the aspects of the core cluster that have impact on the ordering of the MIs with an ideal memory subsystem.
55