IX. EXPERIMENTAL RESULTS
9.4 Data Reuse Characteristic when applying code optimizations
In this section, the data reuse characteristics of selected applications are obtained after the coding optimizations explained in Section 8. Initially, we will focus on the changes that the data reuse characteristic presents under such optimization when performing Scenario 2 (per thread block) of the data reuse characterization. Figures 71~77 show the charts for sta, gsim,
87
bfs, nbf, moldyn, irreg and euler respectively. Only the changes of the first block in the kernels are presented. Each of these kernels has over a 100 blocks and to analyze them all, with all of the variations and explain them, will require an extensive analysis that is beyond the current scope.
DS DS
R
a) b)
DS
R R
DS DS DS c)
d) e) f)
R R R
Figure 71: Data reuse characteristic for block 0 of sta after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 17. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 802. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 802.
DS DS
R
a) b)
DS
R R
c)
DS DS DS
d) e) f)
R
R R
Figure 72: Data reuse characteristic for block 0 of gsim after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison
88
prior to optimizations and after thread clustering. Difference in the reuse degree is 400.
(e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 795. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 794.
DS DS
R
a) b)
DS
R R
c)
DS DS DS
d) e) f)
R R R
Figure 73: Data reuse characteristic for block 0 of bfs after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 17. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 802 (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 802.
DS DS
R
DS
DS DS DS
R
a) b)
R R
R
c)
d) e) f)
R R
Figure 74: Data reuse characteristic for block 0 of nbf after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c)
89
After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 44458. (e) Comparison prior to optimizations and after thread and warp clustering.
Difference in the reuse degree is 60346 (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 60576.
DS DS DS
R
a) b)
R R
DS c)
R
d)
DS
e)
R DS
f)
R
Figure 75: Data reuse characteristic for block 0 of moldyn after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 352556. (e) Comparison prior to optimizations and after thread and warp clustering.
Difference in the reuse degree is 446146. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 447102.
90 After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 47309. (e) Comparison prior to optimizations and after thread and warp clustering.
Difference in the reuse degree is 53448. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 43212.
Figure 77: Data reuse characteristic for block 0 of euler after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 59694. (e) Comparison prior to optimizations and after thread and warp clustering.
91
Difference in the reuse degree is 81755. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 82204.
Let‟s analyze first sta. In Figure 71(a), we see the resulting data reuse characteristic when applying the thread clustering technique. The contour is fairly the same as it for the data reuse characteristic prior to the optimization, presented in Section 9.1. However, a closer observation in fact reveals a very interesting variation in the magnitude for the reuse degree at very specific distances. In order to facilitate the comparisons, Figure 71(d) shows the difference between the data reuse degree in Figure 71(a) and the corresponding characteristic in Figure 29(a). This chart is obtained by subtracting the reuse degree at the corresponding reuse distance in the characteristic in Figure 29(a) from Figure 71(a). The sum of all the values in Figure 71(d) is shown at the bottom part of the figure. We can see that there‟s an increase in the total reuse degree of 17 units when compared to the case with no optimization.
This sum, even though not a proper way to compare the changes in the reuse characteristic between the two cases, still provides a very intuitive way of understanding the improvement on the reuse behavior that the optimizations have over the data reuse patterns. However, as we shall see briefly, there are cases in which this magnitude can change negatively, requiring a different interpretation.
Notice also that in Figure 71(d), the portion of the reuse degree magnitudes at RD=3, 5 have decreased, while the reuse degree for distances RD=1 have increased by much more. Both quantities do not seem directly related. As this observation shows, the reuse degree for reuse distances has therefore increased for the short distances significantly, while there has been a decrement in the longer distances, but not that substantial.
A very particular behavior is observed for the case when thread clustering optimization technique is coupled with warp clustering, as described in Section 8. The resulting data reuse characteristic is presented in Figure 71(b). The contour is still the same, but the reuse distance domain has been reduced by 23.5% (from 17 down to 13). Figure 71(e) presents the comparison chart between Figure 71(b) and Figure 29(a). The total reuse degree increases even more up to 802, even though the reuse distance domain is also reduced, signaling a significant improvement. The fact that the reuse domain has been reduced means that there is less probability for data being evicted before being requested by a different MI at this specific
92
distance apart. In [18], Kuo et. al. report a running time performance improvement of around 60% for sta when applying thread clustering and warp clustering. The performance gain can be explained in part by the changes seen in the data reuse characteristic.
The data reuse characteristic for the case where the three optimization techniques (thread clustering, warp clustering and block scheduling) are applied is presented in Figure 71(c). For the thread block 0 of sta there is not much performance improvement when compared to the case where only thread clustering and warp clustering are applied. However, a performance improvement of more than 80% is reported in [18]. This is because the charts in Figures 71~77 only capture the reuse characteristic within the block itself, therefore any coding optimization that mimics a variation in the way blocks are scheduled will not be entirely visible within the block. However, for nbf, moldyn, irreg and euler, the block scheduling optimization does have an effect on the code. Recall that these optimizations are coding optimizations mimicking a scheduling approach. An analysis of the effects of this coding optimizations technique over the groups of applications used mentioned is left for further research.
Figures 78~84 presents the data reuse characteristic for the case where all blocks within the application are run in parallel when coding optimizations are applied.
DS DS
R
a) b)
DS
R R
c)
R
d) e)
R R
f)
DS DS DS
Figure 78: Data reuse characteristic for all blocks running in parallel of sta after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is -10216. (e) Comparison prior to optimizations and after thread and warp
93
clustering. Difference in the reuse degree is 14229. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 11940.
DS DS
R
a) b)
DS
R R
c)
DS DS DS
d) e) f)
R R R
Figure 79: Data reuse characteristic for all blocks running in parallel of gsim after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering.
Difference in the reuse degree is 8236. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 21363. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling.
Difference in the reuse degree is 17129.
DS DS
a) b)
DS
c)
R R R
d) e)
DS
R
R
f)
R
DS DS
Figure 80: Data reuse characteristic for all blocks running in parallel of bfs after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp
94
clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is -10812. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 13361. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 11324.
d)
R
e) DS
R
f) DS
R DS
a)
R DS
b)
DS
R
c)
R
DS
Figure 81: Data reuse characteristic for all blocks running in parallel of nbf after coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering. Difference in the reuse degree is 6053. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 7400. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is 9138.
95 coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering.
Difference in the reuse degree is -3373236. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is -5170268. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling. Difference in the reuse degree is -5791542.
DS DS coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering.
96
Difference in the reuse degree is -4445. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is -4402. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling.
Difference in the reuse degree is -4647.
d) coding optimizations. (a) After applying thread clustering. (b) After applying thread and warp clustering. (c) After applying thread clustering, warp clustering and block scheduling. (d) Comparison prior to optimizations and after thread clustering.
Difference in the reuse degree is 8904. (e) Comparison prior to optimizations and after thread and warp clustering. Difference in the reuse degree is 7834. (f) Comparison prior to optimizations and after thread clustering, warp clustering and block scheduling.
Difference in the reuse degree is 9172.
Notice once again the case for sta in Figure 78. The contour is the same, but the reuse degree magnitudes vary when compared to Figure 71. Figure 78(e) presents the comparison charts for the cases where the three coding optimizations are applied. In this case, the total reuse degree over the distance domain is drastically reduced after applying the optimizations. The explanation behind this is that when all blocks execute in parallel and are optimized, a situation occurs where the position of the MIs within the reference stream is reduced and/or the addresses accessed by one given MI are now accessed by a different MI earlier in the reference stream. This causes that one MI accesses data simultaneously for an MI in the original reference that would have accessed those addresses are a later position. When this occurs, the reuse degree for that distance will reduce, signaling some improvement under
97
these circumstances, even though the reuse characteristic does not present an increase in the overall reuse degree magnitudes.
Analyzing thoroughly why does in fact the reuse characteristic for the case when all blocks are running in parallel changes in such a way for when optimizations are applied, contrasting this with the behavior observed by block 0 in Figures 29~36, will require to analyze each of the blocks of the kernels. However, as we have explained, the huge amount of parallelism available in our idealized architecture model might be the cause behind this.
A reasonable conclusion to explain the performance improvement after optimizing is the increase of memory coalescing that rescheduling the blocks in a more efficient way cause. For now, neither the current methodology nor the analytical model provided here can quantify memory coalescing within a given MI. Therefore, the performance improvement cannot be explained only considering the data reuse characteristic as analyzed so far. The scope of this paper is not obtain a relationship between the reuse characteristic and the performance improvements, but the behavior just observed makes it necessary to make such analysis in future work.
A similar behavior is presented for all the applications shown. When the optimizations are applied, the total data reuse degree increases and, in some cases, the distance reuse domain is reduced. Notice how all the applications maintain without any significant changes the contour of their reuse characteristic, even when the reuse domain is shrunk.
98