In our design flow, a strict multistage pipeline model was adopted as shown in Figure 3-1.
The start point and end point of the stream is the shared memory. Only the first and last processor can access shared memory. Each SPE only can access its precedence or successor’s local store (LS). If a task on a SPE is going to be migrated for loading balance. Only previous SPE and next SPE could be chosen to migrate. If performance gain is not as much as we estimated when load balance achieved. We added another available SPE at start point or end point.
Figure 3-1 Strict Multistage Pipeline Model
This strict multistage pipeline model much simplifies the design space of task migration.
Considering which processor for task migrating is a complicated work. More cores adopted for parallelizing application, more choices for migration. Strict multistage pipeline model limits the choice on precedence and successor. There are always only two choices for considering regardless of the number of cores we have. The data flow between SPEs and shared memory is much simplified is strict multistage pipeline model. If we don’t restrict the model at first, the data flow between shared memory and local stores (LS) would be very complicated after several iteration of task migration as shown in Figure 3-2. In strict multistage pipeline model, data flow wouldn’t be more and more complex after several migrations.
Figure 3-2 Data Flow would be Complicated after Several Migrations
Figure 3-3 Design Flow
Figure 3-8 briefly described our design flow. A design flow based on strict multistage pipeline model is adopted. Input of design flow is the application source code. First, we do computation optimization on application kernels with some conventional techniques including algebra simplification, SIMD, loop unrolling and software pipelining. Local optimization as better as possible is also important in multicore programming. The optimization effort in this stage influences the result of each step in our design flow afterwards.
Offloading kernels on SPE induces communication overhead. Communication overhead is comparable with computation time in some memory intensive kernel like motion compensation of H.264 decoding. How to hide DMA latency is an important issue in SPE programming. In order to hide DMA latency well, we must estimate DMA overhead as precisely as possible. Additionally, the computation time needed on PPE and on SPE might be different because their architectures are different in nature. SPE architecture is aim for high speed computation but poor for branching. PPE is opposite to SPE. So we should profile on SPE to analysis each kernel’s workload including computation and communication time.
In workload analysis on SPE, we sort communication overhead on SPE into two kinds.
One is DMA issue time needed by SPU. SPU issue DMA commands to MFC costs additional cycles. Another overhead is DMA wait time. The length of DMA wait time is between MFC gets the DMA command and MFC complete the DMA command. DMA wait time needed depending on the input/output data size and data addresses in main memory is continuous or not. After computation optimization, we can get computation/communication ratio of each kernel much more precisely. So we make workload analysis on SPE for getting computation/communication ratio of each kernel which is going to be offloaded on SPEs.
After workload analysis, we allocate kernels on SPEs according to the workload analysis.
We estimate the number of SPEs we needed for meet our performance constraint and the communication time we can hide roughly. Then start the allocation. We provide task allocation guides for solving this NP-complete problem with efficiency. The steps of task allocation is described in chapter 3.2.
After kernels allocated on strict multistage pipeline model, we apply MFC-aware scheduling on each SPE to hide DMA latency. MFC-aware scheduling parallelizes MFC and SPU for hiding DMA latencies. We also provide guides for this NP-complete problem. The detail of MFC-aware scheduling is described in chapter 3.3.
After MFC-aware scheduling, the latency hided in each SPE is diverse because of the computation/communication ratio in each SPE is different. As a result, MFC-aware scheduling may unbalance the workload among SPEs. So we have to do task migration after MFC-aware scheduling for modulating workload balance among all processors. An iterative task migration is adopted, which addressed in detail in chapter 3-4.
After task migration, tasks allocation is determined. If the performance is far from expected or load balance is still worse without any improvement probability. Repartitioning tasks into smaller granularities is needed. Then rerun the design flow from workload analysis.
After the result of task migration, the strict multistage pipeline model is generalized as shown in Figure 3-4. Original data flow is restricted by multistage pipeline model. In fact, not all parameters need to go through previous SPEs. Some parameter can be accessed from shared memory directly. This stage also reduces local store (LS) usage of each SPE. The remaining local store (LS) could be used for buffering. The strategy of buffering is described in chapter 3.5.
Figure 3-4 Generalized strict multistage pipeline model