• 沒有找到結果。

So far, we have detailed the methodology of the analysis in Section 5 using the metrics defined in Sections 4 and 5. The objective is to generate a data reuse characteristic of the application. In this section, we describe the different scenarios in which the data reuse characteristic is obtained, and certain particular properties of SIMT processors that are considered in the analysis. In this work, we focus specifically on GPU architectures, since are the most widely used, although our methodology can be applied to other types of SIMT processors.

In order to perform the analysis, we developed an experimental framework which we called Locality Analyzer for SIMT applications, programmed in C++. The details of our framework are explained in Section VII. The framework allows the modification of certain parameters and to choose the desired scenario. The six scenarios of the analysis implemented so far are:

Scenario 1. Analysis with infinite resources, thread blocks serialized: All threads are modeled as executing in parallel, while blocks are modeled as executing one at a time.

Scenario 2. Analysis with infinite parallel resources, within block analysis: All threads in the block are modeled as executing in parallel. The execution model of thread blocks is irrelevant for this scenario.

Scenario 3. Analysis with infinite resources, all blocks are modeled executing in parallel: all threads are modeled as executing in parallel, no resource limitations of any kind.

Scenario 4. Analysis with infinite resources, „K‟ blocks are modeled as executing in parallel:

all threads in a block are modeled as executing in parallel, models limitations on the blocks that can execute concurrently. The „K‟ concurrent blocks are chosen according to a scheduling policy.

Scenario 5. Analysis with limited resources, „K‟ blocks in parallel: core clusters are modeled, and blocks are assigned to them according to a specific scheduling policy. The analysis focuses on the local reference streams of each core cluster.

Scenario 6. Analysis with limited resources, „K‟ blocks in parallel: core clusters are modeled, and blocks are assigned to them using a scheduling policy. The analysis focuses on the resulting reference stream of the core cluster collective.

27

6.1 Infinite resources, thread blocks are modeled as executing sequentially

BLOCK 0 modeled as executing sequentially. These result in a reference stream in which instructions are repeated with nearly uniform frequency. However, branch divergence and other runtime dynamics will modify the reference streams of different blocks. This type of analysis does not comply with the execution behavior presented by current SIMT processors. In a real architecture, many blocks can execute in parallel in different core clusters. However, by performing the analysis assuming block sequencing, we will be able to observe some important characteristic of the cross-block data reuse behavior, if any. Figure 11(a) illustrates this case.

6.2 Infinite resources, analysis within each thread block

Scenario 2 executes under identical conditions than the initial scenario, except for the fact that the analysis is performed within the thread block. That is, the data reuse characterization is performed only within the reference stream of the block occupying the ideal core cluster. A core cluster with such characteristics allows all threads to execute simultaneously. Within the reference stream context, this means that all memory requests by the threads are served

28

concurrently, with no latency or scheduling that can affect the ordering of the memory instructions. The analyses in scenarios 1 and 2 are performed assuming this particular ideal core cluster. Figure 10(b) illustrates this case.

6.3 Infinite resources, all thread are modeled as executing in parallel

As with the case with the first two scenarios, scenario 3 of the data reuse characterization is also performed assuming an ideal core cluster. In this case, however, the highest theoretical amount of parallelism becomes available by allowing all blocks to execute simultaneously. In order to model this behavior properly, the reference streams of each block are merged into one single reference stream. The resulting reference stream is called the “Aggregate Block”, a concept which we will defined briefly in a formal way. In this case, the maximum number of threads that are able to issue memory requests is equal to the number of MIs executing

Block Scheduler Block Scheduler

Figure 12: Scenario 3 of data reuse characterization assuming ideal core clusters. (a) Illustrates the way blocks are intended to be executed. (b) Illustrates the Aggregate Block that results from merging the streams of all CTAs.

The analysis as performed by Scenario 3 offers significant insight on the applications data reuse behavior. The model illustrated in Figure 12 maintains the constraints of the execution model inherent to SIMT processors, without all the practical parallelism and memory subsystem limitations of a specific architecture. Performing the analysis under such conditions will provide a very particular data reuse characteristic of the application given its code

29

structure and input. Scenario 3 allows us to obtain a reference data reuse characteristic that will be used to compare how different coding optimization techniques impact the data reuse characteristic, and how close they get to reproducing the ideal characterization. For example, a code can have a relative small number of blocks that can be allocated to one core cluster. If there are enough core clusters, and each cluster has enough resources, the ideal data reuse characteristic will be reproduced. Having a reference data reuse characteristic will enable to quantify how optimization procedures alter the reuse behavior.

6.4 Infinite resources, a number ‘K’ of thread blocks modeled as executing in parallel

Scenario 4 is the first one that models limitations on the amount of blocks that can execute concurrently. It has the particularity that it allows to include different block scheduling policies.

Figure 13: Scenario 4 of the reuse degree characterization. Only 'K' blocks are modeled as executing concurrently. This is equivalent as having 'K' ideal core clusters, each one executing one block at the time.

Figure 13 illustrates this situation. There are „K‟ ideal core clusters available, each one running one single block. The scheduling policy implemented in any particular case can select the following block to execute in each SM, resulting in varying reference streams per core cluster.

30

Figure 14: (a) An ideal core cluster executing the reference streams of 'K' parallel blocks.

(b) The streams of parallel blocks merged into a series of aggregate blocks, executed in sequence.

Since the core clusters considered in Scenario 4 are also considered to be ideal, the resulting effect of having blocks executing in parallel can be modeled as in Figure 14. This ideal core clusters represent an abstraction of parallelism resources, which we call array of concurrent slots, which we define as follows.

Definition 3. An array of concurrent slots is an abstraction of a collective of parallel resources capable of executing the instruction/reference stream of a determined number of block(s) simultaneously.

In the Locality Analyzer, concurrent slots appear only in arrays of more than one element.

The blocks within each array of concurrent slots execute in parallel, but arrays are serialized with respect to each other. The blocks running in parallel in one ideal core cluster i.e. array of concurrent slots, as in Figure 14(a), can be merged together to create a series of ⌈ ⌉ aggregate blocks. We define an aggregate block as follows:

Definition 4. An aggregate block is the reference stream that results from merging the reference streams of the blocks in the corresponding array of concurrent slots.

31

These are then serialized as show in Figure 14(b). The analysis is therefore performed over serialized aggregate reference streams over ideal core clusters. It is important to mention that the blocks in concurrent slots are not always merged to create a resulting aggregate block. The merging process will take place depending on the analysis performed, and the parallelism resources to be modeled by a specific analysis.

6.5 Limited resources, ‘K’ block modeled as executing in parallel, within core cluster analysis

Scenarios 5 and 6 of the analysis obtain the data reuse characteristic under conditions in which architectural limitations of the core clusters are modeled. The scheduling of threads is performed on a per-warp basis on NVIDIA GPUs. In NVIDIA GPUs and the number of threads in a warp is 32. The warp size harmonizes with other design characteristics of NVIDIA‟s GPUs: memory bus sizes, cores and functional units. The latter play a major role in the number of cycles needed for a warp to fully execute one instruction.

For the case of memory instructions, the number of cycles per instruction per warp, assuming an ideal memory subsystem, will be dependent on the number of load/store units available to each warp in a given cycle. By taking these into consideration, only a specific amount of threads will be able to issue memory accesses. In certain commercial GPUs, the amount of load/store units available for a warp in one cycle is usually 16, the size of a half-warp. As a consequence of this, the cycles necessary to complete a memory instruction increase.

Scenarios 5 and 6 try to analyze the effect on the data reuse characteristic of an application under these conditions.

32

Figure 15: Scenario 5 of the data reuse characterization analysis. Each core cluster has now a finite number of load/store instructions. The analysis is performed within each core cluster.

Figure 15 illustrates the case for Scenario 5. In this case, there is a finite number of load/store units per core cluster. The warp scheduler inside the core cluster can only issue a number of memory instructions that the load/store units can give service to. In our framework, the number of load/store units can be decided at runtime by the user. This scenario models an additional resource constraint that reduces the total amount of parallelism that the SIMT processor can exploit. In Figure 15, the details of the memory subsystem are modeled ideally, so not to make the analysis depend on the architecture.

The data reuse characterization of Scenario 5 is done on a per core cluster basis. Each core cluster is assigned a series of thread blocks, as shown in Figure 15. As mentioned before, each core cluster is are a more physical representation of the array of concurrent slots. In this case, aggregate blocks are not used despite assigning blocks to each array of concurrent slots. The merging process does not take place even though parallelism can still be exploited. However, Scenario 6 does perform the block merging, as we shall see, and characterizes the data reuse behavior from a different perspective.

Scenario 5 analyzes the reference stream resulting from the serialized blocks assigned to each core cluster. This analysis captures the data reuse behavior that could be taken advantage of by an ideal shared memory subsystem within a specific cluster. The scheduling policy and the number of core clusters in the architecture will definitely have an impact on the reuse characteristic under such circumstances.

33

Since the number of load/store units is limited, all threads are unable to request accesses simultaneously. Therefore, more memory requests will be issued, which will increase the size of the reference stream of each block, and of all the overall blocks assigned to a core cluster.

This will have a significant impact on the data reuse characteristic and the length of the histogram itself. In general terms, it will modify the way the application reuses data.

6.6 Limited resources, ‘K’ blocks modeled as executing in parallel, inter-core cluster

The sixth and final scenario of the data reuse characteristic analysis is identical to Scenario 5 except for one fundamental difference. In this case, the blocks executing in parallel are merged into aggregate blocks. Figure 16 illustrates this case. The execution is modeled as if the series of resulting aggregate blocks where executing in a core cluster in which the total number of load/store units is the aggregate amount of load/store units present over all clusters in the system. This analysis captures the data reuse behavior that could be taken advantage of by an ideal shared memory between the overall threads of all clusters.

2nd Level Scheduler

NoC

On-Chip Mem. Sys.

Mem. Controller

C D L

C C C

C

C DL L L L L

AGGREGATE BLOCK N/K-1

AGGREGATE BLOCK 0

Block Scheduler

Figure 16: Scenario 6 of the data reuse characterization. The analysis is performed over the aggregate of the reference streams in every core cluster. The number of load/store units is the total sum across the core clusters.

The purpose of all these analyses is to get a quantified representation of the reuse characteristic under different parallelism constraints. The amount of parallelism that SIMT

34

processors can exploit is what fundamentally differentiates them from more conventional processors. This is coupled with a specific programming model. When the amount of parallelism that the processor can handle changes, it will interact in a different way with the application: less or more threads can execute concurrently, occupancy varies, coalescing will vary and locality characteristics will change as well as the real resource limitation. All this causes changes in overall running time, and memory performance. The resulting performance is multivariable, and it is difficult to build a model of an application‟s performance just by observing the way it varies.

By providing a way to characterize the data reuse behavior of an SIMT application, it is possible to get further insight on the resulting performance and ways to predict it. Scenario 3 provides a particular data reuse characteristic given the kernel code. This same characteristic will be reproduced if the SIMT processor has enough resources to exploit all the parallelism needed by the kernel. However, as kernels utilize bigger data sets and have larger reference streams, reaching this condition might not be a practical goal. But the ideal data reuse characteristic will be adequate to assess the positive or negative impact that different parallelism constraints will have in its reuse characteristic. This is achieved in a totally isolated way from other factors that affect performance, such as the capabilities for coalescing, or the memory subsystem.

When code tuning is performed, or architectural enhancements are added to the processor, developers proceed in view of the architecture and the programing model. A code tuning technique to improve data coalescing, for example, can also have an impact on the bank-conflict avoidance, contention avoidance, and on the way the schedulers issue instructions, which in turn will have other effects in different parts of the architecture.

Detailing this cascade effects is particularly difficult given the cross-relation between them.

Now that the analyses have been detailed, the next section explains the details of the implementations of the experimental framework.

35