Simulation Results - Compiler Optimization for Reducing Leakage Power in Multithread BSP Progra

To verify our proposed MTPGA algorithm and PPG mechanism, we focused on in-vestigating component utilization in the supersteps. We report two sets of simulation results: one for random TFGs and the other for BSP programs converted from OpenCL kernels. Each set of results compares three types of experiments: (1) no power-gating mechanism (baseline), (2) CADFA with a conventional power-gating mechanism from a previous work [You et al. 2002, 2006], and (3) MTPG with the PPG mechanism.

We first generated random TFGs and applied small programs as thread fragments.

Random TFGs were generated using GGen, which is a random graph generator for scheduling simulations [Cordeiro et al. 2010]. The generation method was a slightly modified version of fanin/fanout method. We added a parameter for the size of layer to control the shape of generated graphs, where a layer is a set of nodes without edges. We generated random edges between adjacent layers only, which forced the generated graphs to fit the D-BSP communication rule. Also, a label swapping phase was added immediately before generating the graph to increase the randomness of thread fragments. Each node in the generated TFGs was mapped to a floating-point DSPstone [Zivojnovic et al. 1994] program. The random TFGs were all DAGs and generated with parameters as follows:

—number of nodes: the number of thread fragments in the graph;

—out-degree: the out-degree of each node controls the number of success of a thread fragment;

—in-degree: the in-degree of each node controls the number of predecessors of a thread fragment;

—number of layers: the number of layers in the graph;

—size of layer: the number of nodes in a layer controls the size of hardware threads.

The energy consumption results for the parameter settings in Table V are listed in Tables VI, VIII, IX, and X. We used 10,756 graph instances to evaluate all settings. All

Table VI. Normalized Total Energy Consumptions of Randomly Generated TFGs for Setting A on Leakage Contribution Set to 10% and 30% (see Table V), Categorized by the Number

of MHP Regions for Cases with Two Hardware Threads Leakage contribution set to 10%

number of MHP regions method dynamic leakage^a leakage^b overhead total

baseline 52.55% 12.35% 35.10% 0.00% 100.00%

CADFA 52.66% 2.74% 36.91% 9.14% 101.46%

MTPG 52.57% 9.96% 35.26% 0.45% 98.24%

baseline 50.60% 12.86% 36.54% 0.00% 100.00%

CADFA 50.70% 2.73% 38.31% 7.63% 99.36%

MTPG 50.63% 7.72% 36.90% 0.89% 96.14%

baseline 49.80% 13.07% 37.13% 0.00% 100.00%

CADFA 49.89% 2.74% 38.90% 7.20% 98.72%

MTPG 49.83% 6.03% 37.76% 1.33% 94.96%

baseline 48.70% 13.35% 37.94% 0.00% 100.00%

CADFA 48.80% 2.73% 39.85% 6.40% 97.77%

MTPG 48.75% 4.82% 38.81% 1.71% 94.09%

baseline 47.64% 13.63% 38.73% 0.00% 100.00%

CADFA 47.73% 2.70% 40.71% 5.68% 96.83%

MTPG 47.70% 4.08% 40.01% 2.11% 93.90%

Leakage contribution set to 30%

number of MHP regions method dynamic leakage^a leakage^b overhead total

baseline 22.52% 20.15% 57.33% 0.00% 100.00%

CADFA 22.56% 4.50% 60.29% 15.28% 102.63%

MTPG 22.53% 16.26% 57.60% 0.73% 97.12%

baseline 21.11% 20.52% 58.37% 0.00% 100.00%

CADFA 21.15% 4.36% 61.18% 12.39% 99.08%

MTPG 21.12% 12.35% 58.95% 1.43% 93.84%

baseline 20.56% 20.66% 58.78% 0.00% 100.00%

CADFA 20.60% 4.33% 61.56% 11.55% 98.05%

MTPG 20.58% 9.55% 59.77% 2.09% 91.99%

baseline 19.85% 20.84% 59.30% 0.00% 100.00%

CADFA 19.89% 4.26% 62.26% 10.13% 96.55%

MTPG 19.87% 7.56% 60.64% 2.67% 90.74%

baseline 19.22% 21.01% 59.77% 0.00% 100.00%

CADFA 19.26% 4.18% 62.78% 8.89% 95.11%

MTPG 19.25% 6.34% 61.68% 3.24% 90.52%

aleakage energy consumed by power-gateable units.

bleakage energy consumed by other units.

results are normalized to the situation without a power-gating mechanism. The total energy consumption is divided into four categories: (1) the dynamic energy dissipated by the processor, (2) the leakage energy dissipated by power-gateable units, (3) the leakage energy dissipated by the entire processor except for power-gateable units, and (4) the overhead due to extra power-gating instructions. The overhead includes the energy consumed by power-gating instructions, the energy consumed due to the latency caused by powering on components that have been incorrectly powered off, and the energy consumed by the predicated power-gating controller. Settings A and B are for machines equipped with two hardware threads, while Setting C is for those equipped with four hardware threads. With MTPG, the total power consumption for each setting was reduced to 93.90%, 93.32%, and 95.12%, respectively, relative to the baseline (i.e., no

Fig. 12. Selected best cases of randomly generated TFGs.

Table VII. Mapping from Node Labels to Benchmark Programs label benchmarks

power-gating mechanism) on leakage contribution set to 10%. On leakage contribution set to 30% with MTPG, the total power consumption for each setting was reduced to 90.52%, 89.51%, and 91.94%, respectively, relative to the baseline.

Figure 12 demonstrates six TFGs representing the best cases for Setting A. As mentioned previously, the graph nodes in the figure are thread fragments and graph edges are dependencies between thread fragments. The color of a node represents the MHP region the node belongs to, which is computed via Algorithm 4; nodes with the same color belong to the same MHP region. There are three MHP regions in Figures 12(a), 12(b), 12(c), and 12(e), four MHP regions in Figure 12(d), and two MHP regions in Figure 12(f). The line styles of the node borders represent their types in an MHP region: dashed lines indicate entry thread fragments, dotted lines indicate exit thread fragments, and solid lines indicate both entry and exit thread fragments.

The type of nodes is used to evaluate the insertion of power-gating instructions using Algorithm 5. Each node is labeled with a unique number that maps to a program.

Table VII lists the mapping from the node labels to DSPstone programs. Note that each DSPstone program is mapped to labels twice for samples with two hardware threads in order to generate random graphs covering both heterogeneous and homogeneous cases; for samples with four hardware threads, each DSPstone program is mapped to four labels for the same reason. MTPG reduced the total energy consumption by an average of about 11% in the cases shown in Figure 12 on leakage contribution set to 30%, namely to 87.79%, 88.15%, 88.27%, 88.39%, 88.59%, and 88.78% relative to the baseline for the cases shown in Figure 12(a) to 12(f), respectively.

Table VI lists the energy consumption results for Setting A with different numbers of MHP regions: on leakage power contribution set to 10% with MTPG, the total energy consumption was 98.24%, 96.14%, 94.96%, 94.09%, and 93.90% for one to five MHP regions, respectively; while on leakage power contribution set to 30% with MTPG, the total energy consumption was 97.12%, 93.84%, 91.99%, 90.74%, and 90.52% for one to five MHP regions, respectively. The results indicate that the energy consumption by the random sample reduced as the number of MHP regions increased, with the trend stabilizing for more than four MHP regions. As indicated in Table VI, while CADFA results in less leakage energy in power-gateable units (about 30% energy consumption

Table VIII. Normalized Total Energy Consumptions of Randomly Generated TFGs for Setting B Leakage contribution set to 10%

number of MHP regions method dynamic leakage^a leakage^b overhead total

baseline 54.79% 11.77% 33.44% 0.00% 100.00%

CADFA 54.89% 2.70% 35.13% 10.30% 103.02%

MTPG 54.80% 10.54% 33.53% 0.26% 99.13%

baseline 51.23% 12.69% 36.07% 0.00% 100.00%

CADFA 51.33% 2.68% 37.71% 7.62% 99.34%

MTPG 51.25% 9.67% 36.25% 0.56% 97.73%

baseline 51.26% 12.69% 36.05% 0.00% 100.00%

CADFA 51.36% 2.68% 37.67% 7.74% 99.45%

MTPG 51.29% 8.42% 36.32% 0.76% 96.79%

baseline 51.20% 12.70% 36.10% 0.00% 100.00%

CADFA 51.29% 2.69% 37.65% 7.53% 99.16%

MTPG 51.23% 6.86% 36.45% 1.01% 95.55%

baseline 50.64% 12.85% 36.51% 0.00% 100.00%

CADFA 50.73% 2.69% 38.20% 7.29% 98.91%

MTPG 50.67% 6.06% 37.04% 1.26% 95.04%

baseline 50.45% 12.90% 36.65% 0.00% 100.00%

CADFA 50.54% 2.64% 38.24% 6.99% 98.41%

MTPG 50.49% 5.46% 37.26% 1.47% 94.69%

baseline 48.05% 13.52% 38.43% 0.00% 100.00%

CADFA 48.15% 2.67% 40.13% 5.56% 96.50%

MTPG 48.10% 4.77% 39.13% 1.65% 93.65%

baseline 48.75% 13.34% 37.91% 0.00% 100.00%

CADFA 48.83% 2.72% 39.37% 5.29% 96.21%

MTPG 48.79% 4.50% 38.60% 1.42% 93.32%

aleakage energy consumed by power-gateable units.

bleakage energy consumed by other units.

relative to MTPG), it suffers the overhead of traditional power-gating instructions (about 11× the energy consumption relative to MTPG). This overhead is mostly due to the additional cycles required to internally turn on components that are incorrectly turned off. The leakage energy consumed by ones other than power-gateable ones increases in both CADFA and MTPG because of the extra power-gating instructions in that extra power-gating instructions affect instruction fetching, which results in more execution cycles when power-gating instructions are not present and thus increase the leakage energy.

Tables VIII and IX list energy consumption results for Setting B categorized by the number of MHP regions. Compared to Setting A, Setting B generates TFGs with more thread fragments and more layers. The best energy-saving result for Setting B with MTPG was 89.51% energy consumption relative to no power-gating mechanism when there are eight MHP regions on leakage contribution set to 30%. Similar to the experimental results for Setting A, the trend stabilizes when there are more than five MHP regions. Table X lists the energy consumption results for Setting C categorized by the number of MHP regions. Setting C generates TFGs for hardware equipped with four hardware threads. The best energy-saving result for Setting C with MTPG was 91.94% energy consumption relative to no power-gating mechanism when there are five MHP regions on leakage contribution set to 30%. The CADFA results indicates how a large amount of overhead energy could be consumed by incorrectly inserted power-gating instructions. With the traditional CADFA method, the samples in one MHP region consumed 117.43% energy relative to no power-gating mechanism on leakage

baseline 21.52% 20.41% 58.07% 0.00% 100.00%

CADFA 21.56% 4.32% 60.67% 12.60% 99.14%

MTPG 21.53% 13.55% 58.50% 1.23% 94.81%

baseline 21.45% 20.43% 58.12% 0.00% 100.00%

CADFA 21.49% 4.32% 60.62% 12.21% 98.65%

MTPG 21.46% 11.04% 58.69% 1.62% 92.81%

baseline 21.09% 20.52% 58.39% 0.00% 100.00%

CADFA 21.13% 4.30% 61.07% 11.76% 98.26%

MTPG 21.10% 9.69% 59.23% 2.02% 92.05%

baseline 20.93% 20.56% 58.51% 0.00% 100.00%

CADFA 20.97% 4.21% 61.05% 11.22% 97.45%

MTPG 20.95% 8.71% 59.49% 2.35% 91.49%

baseline 19.41% 20.96% 59.63% 0.00% 100.00%

CADFA 19.45% 4.14% 62.23% 8.64% 94.46%

MTPG 19.43% 7.40% 60.71% 2.56% 90.09%

baseline 19.79% 20.86% 59.35% 0.00% 100.00%

CADFA 19.82% 4.25% 61.64% 8.28% 93.99%

MTPG 19.80% 7.04% 60.45% 2.22% 89.51%

aleakage energy consumed by power-gateable units.

bleakage energy consumed by other units.

contribution set to 30%. These results reveal that our method is practical for both hardware configurations.

Focus is now directed to examining our method using BSP benchmarks. We used three BSP programs from BSPedupack, a library of numerical algorithms written in C according to the BSP model [Bisseling 2004]. Four programs of BSPedupack were applied to examine our optimization method, including fft, inprod, lu, and matvec.

Figure 13 shows the the energy consumption normalized to the baseline case with no power-gating mechanism. On leakage contribution set to 10%, the average reduction in total energy consumption was 4.32%, and was largest for mv (7.52%) and the smallest for lu (2.27%). On leakage contribution set to 30%, the average reduction in total energy consumption was 8.32%, and was largest for mv (13.23%) and the smallest for lu (5.30%).

We then evaluated our method using BSP programs from OpenCL-based kernels.

OpenCL is an industry attempt to provide standards for GPGPU and heterogeneous multicore programming. An OpenCL program can be roughly divided into host code and kernel code, where the host code is executed on an MPU and the kernel code on OpenCL devices. The OpenCL kernel codes comprise concurrent threads with global barriers, making it easy to transfer them into BSP programs. We incorporated ker-nel serialization to avoid the threading overhead in parallel kerker-nel execution and to handle synchronization for barriers in kernel functions. We adopt a work-item coalesc-ing scheme [Lee et al. 2010] for kernel serialization, which serializes kernel execution by enclosing kernel functions within triply nested loops to iterate these kernel functions

Table X. Normalized Total Energy Consumptions of Randomly Generated TFGs with Setting C Leakage contribution set to 10%

number of MHP regions method dynamic leakage^a leakage^b overhead total

baseline 62.23% 9.83% 27.94% 0.00% 100.00%

CADFA 62.35% 2.55% 29.14% 15.16% 109.19%

MTPG 62.25% 8.33% 28.04% 0.34% 98.95%

baseline 58.90% 10.70% 30.40% 0.00% 100.00%

CADFA 59.01% 2.57% 31.59% 12.74% 105.91%

MTPG 58.93% 7.25% 30.60% 0.72% 97.50%

baseline 55.87% 11.49% 32.64% 0.00% 100.00%

CADFA 55.97% 2.60% 33.89% 10.96% 103.42%

MTPG 55.91% 6.11% 33.04% 1.19% 96.25%

baseline 55.85% 11.49% 32.65% 0.00% 100.00%

CADFA 55.96% 2.67% 34.04% 10.05% 102.73%

MTPG 55.89% 5.27% 33.19% 1.10% 95.45%

baseline 52.79% 12.29% 34.92% 0.00% 100.00%

CADFA 52.90% 2.67% 36.23% 10.73% 102.53%

MTPG 52.84% 4.00% 35.94% 2.34% 95.12%

Leakage contribution set to 30%

number of MHP regions method dynamic leakage^a leakage^b overhead total

baseline 30.21% 18.15% 51.64% 0.00% 100.00%

CADFA 30.26% 4.73% 53.87% 28.56% 117.43%

MTPG 30.22% 15.40% 51.83% 0.62% 98.06%

baseline 27.28% 18.91% 53.80% 0.00% 100.00%

CADFA 27.34% 4.57% 55.90% 22.91% 110.72%

MTPG 27.30% 12.85% 54.15% 1.28% 95.57%

baseline 24.82% 19.55% 55.63% 0.00% 100.00%

CADFA 24.86% 4.44% 57.75% 18.87% 105.93%

MTPG 24.83% 10.40% 56.31% 2.03% 93.58%

baseline 24.86% 19.54% 55.60% 0.00% 100.00%

CADFA 24.91% 4.55% 57.96% 17.20% 104.61%

MTPG 24.88% 9.00% 56.49% 1.87% 92.24%

baseline 22.48% 20.16% 57.36% 0.00% 100.00%

CADFA 22.53% 4.38% 59.51% 17.62% 104.04%

MTPG 22.50% 6.57% 59.03% 3.84% 91.94%

aleakage energy consumed by power-gateable units.

bleakage energy consumed by other units.

in a given index range. Each thread computed the workload of a work group. The sources of OpenCL kernels were as follows: kernel DCT, DwtHaar1D, FastWalshTransform, Histogram, MatrixTranspose, Permute, PrefixSum, RadixSort, and SimpleConvolution are from AMD OpenCL SDK, while kernel BP msg is an OpenCL implementation of the BP application.

Figures 14 through 19 show our experimental results for BSP programs from OpenCL-based kernels. Figures 14 through 17 show the energy consumption normal-ized to the baseline case with no power-gating mechanism with different experimental parameters including leakage contribution and number of SMT threads. With a four-way SMT architecture, Figures 14 and 15 show the energy consumption on leakage contribution set to 10% and 30%, respectively. On leakage contribution set to 30%, the average reduction in total energy consumption was 10.09%, and was largest for DCT (10.84%) and the smallest for Permute (9.20%). On leakage contribution set to

Fig. 13. Normalized total energy consumptions of BSP programs from BSPedupack.

Fig. 14. Normalized total energy consumptions of BSP programs from OpenCL kernels on four-way SMT system with leakage contribution set to 10%.

10%, the average reduction in total energy consumption was 4.27%, and was largest for DCT (4.74%) and the smallest for Permute (3.86%). The energy breakdown of the BSP program from OpenCL kernels differs slightly from that for randomly generated D-BSP programs. On leakage contribution set to 30%, the leakage energies dissipated by power-gateable units were 3.16% and 3.19% in CADFA and MTPG, respectively.

CADFA consumed nearly the same amount of leakage energy in power-gateable units as MTPG (about 99% energy consumption relative to MTPG), which explains why MTPG saves more energy in this setting than it does in randomly generated D-BSP programs. Figures 16 and 17 show the energy consumption in an experimental en-vironment with eight-way SMT and the leakage contribution set to 10% and 30%, respectively. Experimental results show that our method could be applied to eight-way SMT architectures. As shown in Figure 17, while energy consumption of a system with CADFA grew, the system with MTPG successfully sustained the growing energy consumption and reduced 10% total energy on average.

The code sizes of OpenCL-based BSP programs relative to the baseline are shown in Figure 18. The comparison is based on the text section of a user program, exclud-ing libraries and C runtime codes that could not be analyzed in our experimental environment. The average increases in code size due to the insertion of power-gating

Fig. 15. Normalized total energy consumptions of BSP programs from OpenCL kernels on four-way SMT system with leakage contribution set to 30%.

Fig. 16. Normalized total energy consumptions of BSP programs from OpenCL kernels on eight-way SMT system with leakage contribution set to 10%.

instructions were about 13% (ranging from 10.17% to 15.25%) and 3% (ranging from 1.45% to 5.26%) with CADFA and MTPG, respectively. The number of power-gating instructions of CADFA was reduced to about 89% using MTPGA, which reveals that MTPG efficiently inserts power-gating instructions for multithread programs.

Figure 19 shows the experimental results for OpenCL-based BSP programs with different configurations of the leakage contribution on a four-way SMT machine. MTPG reduced the total energy consumption from 4.28% to 18.54% for leakage contribution from 10% to 90%, respectively; in contrast, CADFA consumed more energy (from 1.13 to 1.56×) than the baseline case of no power-gating mechanism. The PPG and MTPG reduce leakage energy consumption by carefully managing the component status using predicated bits and appropriately inserting power-gating instructions. On the other hand, a large number of incorrect power-off instructions inserted by CADFA introduce many extra cycles while waiting for the internal powering on of components, and this deteriorates further as the leakage contribution increases. These observations indicate

Fig. 17. Normalized total energy consumptions of BSP programs from OpenCL kernels on eight-way SMT system with leakage contribution set to 30%.

Fig. 18. Increases in code size.

that our technique is more effective than existing technologies at improving leakage control for BSP multithread programs.

7. DISCUSSION

In this section, we discuss the impact of latency and the capability to apply MTPGA on real hardware. Latency in processors affects execution time, which directly affects the result of power-gating optimization. Latencies in processors include pipelining latency and memory access latency. Pipelining latency is caused by pipeline hazards, where instructions are stalled because of structure hazards or nonresolved data dependencies.

Memory access latency is caused by the memory hierarchy, such as cache miss.

Pipelining latency and memory access latency are both discussed in traditional power-gating analyses for single-thread environments such as CADFA [You et al. 2006]

and sink-n-hoist [You et al. 2005, 2007]. These methods analyze component usage with regard to the shortest latency, which guarantees that leakage energy would be reduced in any case. CADFA is a conservative method because it estimates the saved leakage energy with the worst case of power gating [You et al. 2006]. The proposed MTPGA,

Fig. 19. Normalized energy consumptions for different leakage contributions.

based on CADFA, considers both pipelining latency and memory access latency as does CADFA. The latency of an instruction is considered with its minimal delay in our estimation; thus a multiply operation is considered with its shortest operation time and caches are considered perfect, which means that cache miss never occurs. Never-theless, MTPGA also conservatively estimates the inactive period in an MHP region with the worst case, namely the minimal thread execution time among threads. With conservative estimation, the experimental results reveal that our method could save about 10% energy consumption on BSP programs (on leakage contribution set to 30%);

the energy reduction can be further improved by using more precise analyses if the memory access time could be modeled at compile time. When the instruction fetching policy changes in SMT, our method is also applied because it estimates energy con-sumption with the worst case of concurrent threads, which guarantees that leakage energy would be reduced in any case. With a more precise performance analysis model for SMT, it is possible to further reduce the leakage energy. To apply our method to real systems or different processor architectures, one should update the estimation model with designated latency. Furthermore, one might be interested in incorporating varying latency analysis with the ILP estimation model [Li and Xue 2004] into a power model to improve the leakage energy savings.

Our method is capable of dealing with out-of-order execution with certain hardware support [You et al. 2006]. By dynamic scheduling techniques, superscalar processors fetch a bunch of instructions and issue these instructions concurrently with regard to data dependence among them, which may break the arrangement of power-gating operations in a thread inserted by a sequential compiler if the dependence between power-gating instructions and normal instructions is not properly considered. To en-sure the inserted power-gating instructions are issued correctly, a power management controller could be implemented in chip to issue power-gating instructions at correct timing. The power management controller consists of a power direction buffer and component usage monitors. Once a power-gating instruction is decoded, the instruc-tion dispatcher dispatches the instrucinstruc-tion to the power direcinstruc-tion buffer of the power management controller. The power management controller is capable of knowing the component usage by monitoring instructions at reservation stations. When the power management controller detects that all instructions using the component are com-pleted, it would issue the power-gating instructions in the power direction buffer and turn off the component according to a power directive. Finally, the power manage-ment controller removes the power-gating instructions from the power-gating direction buffer. In this regard, the situation where an instruction finds its function unit turned off can be avoided, meaning our approach can be applied to out-of-order machines.

8. CONCLUSION

This article has presented a foundation framework for compilation optimization that reduces the power consumption on SMT architectures. It has also presented PPG oper-ations for improving the energy management of multithread programs in hierarchical BSP models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. Our preliminary experi-mental results on a system with leakage contribution set to 30% show that using a system with PPG support and using the MTPGA method reduced the total energy con-sumption by an average of 10.09% for BSP programs and by up to 10.49% for D-BSP programs relative to the system without a power-gating mechanism, and reduced the total energy consumption by an average of 4.27% for BSP programs and by up to 6.68%

for D-BSP programs on a system with leakage contribution set to 10%, demonstrating that our mechanisms are effective in reducing the leakage power in hierarchical BSP multithread environments.

REFERENCES

R. Barik. 2005. Efficient computation of may-happen-in-parallel information for concurrent Java programs.

In Proceedings of the 18^thInternational Conference on Languages and Compilers for Parallel Computing (LCPC’05). Lecture Notes in Computer Science, vol. 4339, Springer, 152–169.

N. Bellas, I. N. Hajj, and C. D. Polychronopoulos. 2000. Architectural and compiler techniques for energy reduction in high-performance microprocessors. IEEE Trans. VLSI 8, 3, 317–326.

R. H. Bisseling. 2004. Parallel Scientific Computation: A Structured Approach using BSP and MPI. Oxford University Press.

J. A. Butts and G. S. Sohi. 2000. A static power model for architects. In Proceedings of the 33^rdAnnual

在文檔中 Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs (頁 22-34)