definition of reuse distance for SIMT processors given in Section 4. This combination would allow us to obtain a data reuse characteristic for SIMT applications as each thread executes its correspondent instructions. But in order to build the complete data reuse characteristic it is necessary to develop a step by step description of the methodology, and obtain a physical meaning behind it.
In more conventional processors, the date utilization behavior is uni-dimensional (one address, one memory instruction) in principle. As explained in the previous section, a data structure per memory subsystem (either private or shared) holds an entry for each address accessed. As we have explained before, every time a new memory access is sent to memory, the data structure is traversed in order to assess whether a previous access to the address has been previously issued. Such methodology of analysis is also inappropriate for SIMT processors for the following reasons.
First, since many addresses are requested simultaneously, how to decide which of the addresses occupies which position within the stack? Figure 9(b) illustrates this issue when the MIs „i‟ and „i+1‟ are executed, shown in Figure 9(a). The Option 1 in Figure 9(b) shows a possible way in which the addresses are ordered in a stack when MI „i‟ executes. With this ordering, we can see the values of the RD for addresses „A‟ and „C‟ when MI „i+1‟ executes, which are both equal to 1. On the other hand, if the order in which the addresses accessed by MI „i‟ is altered when populating the stack, as shown in Option 2 of Figure 9(b), the values of the RD for addresses „A‟ and „C‟ change from 1 to 4. The issue is visible because of the complexity in the locality behavior due to the parallelism exploited by SIMT architectures that this methodology is not able to model appropriately. Notice also the stack growth from one MI to the next as compared to the cases in Figures 5, 6 and 7.
The second reason is that the growth of reuse stacks is expected to be much faster in SIMT machines. This is due to the large amount of parallelism present in the architecture. There might be the case that threads have largely independent data sets. Then, traversing a potentially large stack for thousands of different addresses referenced in one single MI is too computationally intensive. The third reason lies in the fact that the locality characteristic
22
captured by the traditional methodology assumes a sequencing of the memory addresses accessed imposed by the execution and programming model of the conventional processors.
This assumption, which is inherent to the methodology used to analyze applications in those architectures, is only valid when the data utilization behavior is essentially uni-dimensional.
P1
Figure 9: Limitations when performing the baseline methodology for reuse distance analysis on SIMT processors. (a) Subset of the reference stream as it appears in Figure 7.
(b) Possible ways to arrange the accessed addresses on the stacks.
Keeping track of each individual address in SIMT applications is a challenging task, to say the least. In addition, the optimization procedure for the kernels these architectures execute is not based on individual addresses, but thinking on the simultaneous multithreading capabilities of the architecture [8, 9]. Code tuning optimizations are therefore implemented from a broader scope, by carefully analyzing the memory access patterns of the applications.
Consequently, we provide a better suited methodology to obtain the data reuse characteristic of SIMT applications based on the data reuse degree and the previously defined reuse distance.
The detailed methodology is visible in the flow chart in Figure 10 and synthesized in Algorithm 1. We describe it as follows:
1. Select a memory instruction „i‟ for analysis.
2. Scan through the addresses accessed by MI „i‟ and build the address array „Xi‟. Store this information in a data structure. In such data structure, each entry is keyed with the address, and the value „v‟ of the entry is (Xi), i.e. the multiplicity „M‟ for the addresses in the address array „Xi‟ of MI „i‟.
23 shows all the steps when performing the analysis over a reference stream.
3. Select the subsequent MIs „j‟, where . We can define j = i+k, where , imposing the condition , where „S‟ is the number of MIs in the reference stream of the application.
4. Create a different data structure for MI „j‟ similar to the one created for MI „i‟, and create address array „Xj‟. As with step 2, the keys are the addresses in the address array of „j‟ and the value of each entry is the corresponding multiplicity of the addresses in the array.
5. Search for the common addresses in both data structures, and determine the total data reuse degree (DS) between MI „i‟ and „j‟.
6. Once the DS has been calculated, the result is accumulated in the reuse distance histogram in the entry corresponding to RD = i - j.
24
7. Steps 2~6 are repeated for all subsequent MIs that appear later than „j‟ in the reference stream. All the data reuse degrees for all subsequent cases will be accumulated in the corresponding entry of the reuse distance histogram
8. Steps 1~7 are repeated for all subsequent MIs that appear later than „i‟ in the reference stream. The data reuse degrees are stored in the corresponding reuse distance histogram entries. Thus, the data reuse characteristic is generated.
Algorithm 1. DR_Analysis( ) // RF is the reference stream 1. S = RF.size;
2. for i=0 to S-1; // Step1 3. addressStruct(mI,RF(i)); // Step 2
4. for j = i + 1 to S-1 // Step 3 5. addressStruct(mJ, RF(j)); // Step 4 6. for a = 0 to mJ.size // Step 5 7. if ( mI.find(mJ(a)) )
8. mIJ.put(mJ(a));
9. DR = DR + mIJ(a).v;
10. RD = i - j; // Step 6a 11. hist(RD) = hist(RD) + DR; // Step 6b 12.
13. Function addressStruct( m, MI ) // scans addresses, 14. for k=0 to MI.simAccesses;
15. is = m.find( MI(k).address );
16. if ( is )
17. m[MI(k).address]++; // calculates multiplicity 18. else
19. m.insert(MI(k).address);
20. m[MI(k).address] = 1;
This model in particular uses reduction operations that concentrate all the reuse degree values from the reference stream in the respective histogram‟s entries. This gives insight on how frequently does the application reuses data and at which frequency does the application reuses data the most. Not only that, it also gives insight what are the total varying degrees of reuse at these different frequencies.
25
One particular property of our methodology is that, unlike the case for CMP systems, it is not dependent on the details of the SIMT architecture‟s memory subsystem, or on any other implementation details. The data structures are built based on the MIs of the reference stream, and the histogram is accumulative. The MIs are analyzed in the order in which they are expected to be issued from the application‟s perspective. In a real architecture, the ordering of the MIs and the addresses they access are heavily dependent on the practical limitations of the architecture itself: the number of simultaneous transactions carried out by the memory subsystem, whether there are bank-conflicts, the bandwidth utilization, etc. Thus, there is no explicit trade-off between accuracy and generality embedded in our methodology, since it is mostly determined by the code structure of the kernel.
As mentioned, the methodology allows analyzing the reference stream independently of any particular architecture. This enables to model locality under different conditions that would yield a data reuse degree characteristic abstracted from the limitations of current architectures.
For example, given that we have a proper instrumentation tool, we would be able to obtain the reuse distance histogram assuming no limitations in the architecture‟s resources: infinite SMs, infinite load/store units, infinite thread capacity allocation, number of registers and issuing capabilities, bandwidth limitations, etc. In this way, we model the application‟s locality under a very controlled environment, solely dependent on the application‟s structure and the SIMT programming model, a feature particularly useful given the fast pace at which SIMT architecture are currently evolving.
26