Chapter 4 Experimental Results
4.3 Power and Performance Trade-Off
In the proposed techniques, the read-write aware throttling mechanism is the main contributor to the DRAM power reduction. For throttling based power reduction mechanisms, the throttle delay is critical to both the DRAM power consumption and the system performance.
Long throttle delay leads to better power reduction since the DRAM ranks stay in the low power mode for a long period. However, long throttle delay also leads to worse impact on the system performance because all the commands have to wait for a long period before they are processed by the DRAM. On the other hand, short throttle delay allows the DRAM to process memory commands more frequently but it also limits the effectiveness on the power reduction.
In order to see how different throttle delays affect the performance of our work, an evaluation on different throttle delays is carried out. The evaluation uses all benchmark combinations fp1, fp2, fp3 and fp4. Since every benchmark combination has a different memory command pattern, we averaged the evaluation results of all four benchmark combinations. The average simulation results are shown in Table VIII. All the throttle delays are in CPU cycles.
The improvements are the differences between our work and the previous work [21].
48
Table VIII
Effect of different throttle delays Throttle
Delay
Power reduction percentage System performance overhead Previous
work [21] Our work Improvement Previous
work [21] Our work Improvement
100 64.67% 75.46% 10.79% 1.03% 0.61% 0.42%
The evaluation results show that our work is stable with different throttle delays. It steadily reduces around 75% of DRAM power consumption, which is an upper bound of power reduction, and is around 10% better than the previous work. Moreover, the system performance degradation of our work is slightly better than the previous work. This shows that the read-write aware reordering mechanism effectively relieves the impact on the system performance. The evaluation results also shows as the throttle delay increases, the difference in power reduction between our work and the previous work gets smaller. It is because that when the throttle delay is too large, all the DRAM ranks are in the low power mode for most of the time. Therefore the power consumption is low and the performance degradation is dramatic.
With the results in Table VIII, we can further illustrates the trade-off characteristic between power and performance. By varying the throttle delay, the evaluation shows how our work reacts to different system performance degradations. The average results of all benchmark combinations are shown in Fig. 15, where both the power and performance are normalized to the DRAM with no power management policy. The result shows that our work has a better
49
power and performance trade-off characteristic. Under the same system performance degradation, our work reduces around 10% more of DRAM power than the previous work.
Fig. 15 Average power and performance trade-off characteristics on SPEC CPU2006 [31].
It is noticeable that our work, unlike the previous work [21], is sensitive to the memory command patterns of benchmarks. To show the difference, the evaluation results of benchmark combinations fp1 and fp3 are shown in Fig. 16 and Fig. 17 respectively. The number of read requests is much larger than write requests in fp3, while write requests dominates over read requests in fp1.
50
Fig. 16 Power and performance trade-off characteristics for fp1.
Fig. 17 Power and performance trade-off characteristics for fp3.
The trade-off curves of the previous work [21] in Fig. 16 and Fig. 17 have similar slopes.
On the other hand, the slope of trade-off curves of our work are different for fp1 and fp3. The results show that when the application is more read intensive, our work is able to effectively reduce more power when with a slight more system performance degradation. The reason is
51
that, the read-write aware throttling accumulates write accesses in the RQ until a read access appears. For read intensive applications, only a few write accesses are kept in the RQ by the read-write aware throttling. Therefore, a slight increment in system performance degradation indicates that the throttle delay is greatly lengthened, which also leads to a much better power reduction.
In order to evaluate how our work performs when the context dynamically forks out, the evaluation on the SPLASH-2 benchmarks [32] is carried out and the resulting trade-off curves are shown in Fig. 18. Both the power reduction and the system performance degradation are normalized to the native DRAM. The system performance here is measured by the number of cycles used to complete a benchmark.
Fig. 18 Average power and performance trade-off characteristics on SPLASH-2 [32].
The trade-off curves in Fig. 18 show that the difference between our work and the previous work [21] is smaller than the results obtained from running SPEC CPU2006 benchmarks. The reason is that the SPLASH-2 benchmarks are not memory intensive comparing to the SPEC CPU2006 benchmarks. As the result, the DRAM is turned off most of the time during evaluation
52
for both the previous work [21] and our work. However, our work still achieves higher power reduction under the same system performance. The statistics of main memory requests per million cycles of different benchmark combinations and benchmarks are listed in Table IX to show the memory intenseness.
Table IX
Main memory requests per million cycles of different benchmarks Benchmark combinations/
The detail evaluation results of each benchmark combination are shown in Table X. The improvement section shown in Table X is normalized to the previous work. The detail evaluation results show that for a variety of applications, our work provides a superior power reduction at the cost of a minor system performance degradation.
53
Table X
Detail evaluation results on different throttle delays for SPEC CPU2006 [31]
Throttle delay
Previous work [21] Our work Improvement
Power (Watts) MIPS Power (Watts) MIPS Power MIPS
Previous work [21] Our work Improvement
Power (Watts) MIPS Power (Watts) MIPS Power MIPS
Previous work [21] Our work Improvement
Power (Watts) MIPS Power (Watts) MIPS Power MIPS
54
Throttle delay
Previous work [21] Our work Improvement
Power (Watts) MIPS Power (Watts) MIPS Power MIPS
Detail evaluation results on different throttle delays for SPLASH-2 [32]
Throttle delay
Previous work [21] Our work Improvement
Power (Watts) Cycles Power (Watts) Cycles Power Cycles 100 0.86591 386216401 0.80595 383973995 6.92% 0.58%
Previous work [21] Our work Improvement
Power (Watts) Cycles Power (Watts) Cycles Power Cycles 100 1.18733 1626732828 1.06719 1599358623 10.12% 1.68%
200 1.18243 1646152322 1.06244 1613998522 10.15% 1.95%
400 1.17409 1680569237 1.06071 1630723965 9.66% 2.97%
800 1.15385 1728206075 1.05251 1677411431 8.78% 2.94%
1600 1.11851 1783437872 1.03031 1740400454 7.89% 2.41%
3200 1.03481 1785862617 0.97675 1769778925 5.61% 0.90%
6400 0.97555 1793418974 0.94211 1781048450 3.43% 0.69%
(b) fft
55
Throttle delay
Previous work [21] Our work Improvement
Power (Watts) Cycles Power (Watts) Cycles Power Cycles 100 0.70985 1140576902 0.69463 1138116645 2.14% 0.22%
200 0.70761 1144003008 0.69385 1139193870 1.94% 0.42%
400 0.70981 1149568000 0.69341 1141876406 2.31% 0.67%
800 0.70855 1151543397 0.69405 1150034077 2.05% 0.13%
1600 0.70548 1159231177 0.69312 1154615545 1.75% 0.40%
3200 0.70202 1158699823 0.68899 1160365744 1.86% -0.14%
6400 0.6924 1161981653 0.68676 1159341940 0.81% 0.23%
(c) fmm
Throttle delay
Previous work [21] Our work Improvement
Power (Watts) Cycles Power (Watts) Cycles Power Cycles 100 0.83416 567929757 0.79279 561936504 4.96% 1.06%
Previous work [21] Our work Improvement
Power (Watts) Cycles Power (Watts) Cycles Power Cycles 100 0.65358 1146352684 0.65019 1146958803 0.52% -0.05%
200 0.65372 1146377562 0.65036 1145273539 0.51% 0.10%
400 0.65382 1146687597 0.6501 1143846545 0.57% 0.25%
800 0.65412 1146328710 0.64973 1145234276 0.67% 0.10%
1600 0.65343 1145896274 0.64998 1147168394 0.53% -0.11%
3200 0.65286 1146058867 0.64953 1148778795 0.51% -0.24%
6400 0.65252 1147747030 0.64993 1146213644 0.40% 0.13%
(e) barnes
56
Chapter 5
Conclusions and Future Works
This thesis proposes a DRAM scheduling policy to magnificently reduce the power consumption of DRAM. The read-write aware throttling mechanism allows the DRAM ranks to stay in the low power mode for a longer period of cycles. It improves the power saving by 10%~15% on average. The rank level read-write reordering forces DRAM to handle read requests, which are critical to the system performance, as soon as it can. It reduces the system performance degradation caused by the power management without sacrificing much power saving. From the experiments, our work reduces the DRAM power consumption by around 75%, which is better than the previous work and the oracle solution. Meanwhile, it causes only 1%~3% system performance degradation, which is smaller than the existing power management policy.
As for the future, we will add a controller that dynamically adjust the throttle delay in run-time to relieve the system performance degradation. We will also explore the potential of combining the techniques proposed in this thesis with other related works such as the automatic data migration, which creates empty ranks that can be shut off until the throttle delay is reached.
It is possible to implement our work on the hybrid main memories such as the cached DRAM, which decreases DRAM access and therefore reduces more power. Our work can also be improved by working with the write combining technique, which combines write requests at the last level cache. Furthermore, we will extend this work by utilizing the multiple low power modes provided by the most recent DRAM circuits.
57
References
[1] G. Zhang, H. Wang, X. Chen, S. Huang, and P. Li, “Heterogeneous multi-channel: Fine-grained DRAM control for both system performance and power efficiency,” in Proceedings of the 49th Design Automation Conference, 2012.
[2] A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. P.
Jouppi, “Rethinking DRAM design and organization for energy-constrained multi-cores,”
in Proceedings of the 37th International Symposium on Computer Architecture, 2010.
[3] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin, “DRAM energy management using software and hardware directed power mode control,” in Proceedings of the 7th International Symposium on High-Performance Computer Architecture, 2001.
[4] K. Chandrasekar, B. Akesson, and K. G. W. Goossens, “Run-time power-down strategies for real-time SDRAM memory controllers.” in Proceedings of the 49th Design Automation Conference, 2012.
[5] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch.
“Disaggregated memory for expansion and sharing in blade servers,” in Proceedings of the 36th International Symposium on Computer Architecture, 2009.
[6] D. Meisner, B. Gold, and T. Wensich, “PowerNap: Eliminating server idle power,” in Proceedings of the 36th International Symposium on Computer Architecture, 2009.
[7] K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and M. Horowitz,
“Rethinking DRAM power modes for energy proportionality,” in Proceedings of the 45th International Symposium on Microarchitecture, 2012.
[8] JEDEC Standard: DDR2 SDRAM Specification, JEDEC Solid State Technology Association, 2009.
[9] E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana, “Self-optimizing memory controllers: A reinforcement learning approach,” in Proceedings of the 35th International Symposium on Computer Architecture, 2008.
[10] J. Janzen, “Calculating memory system power for DDR SDRAM,” Designline, vol. 10, no.
2, 2001.
[11] Calculating memory system power for DDR3, Micron Technology, Inc, 2007.
[12] 512Mb: x4, x8, x16 DDR2 SDRAM, Micron Technology, Inc., 2012.
[13] H. Park, S. Yoo, and S. Lee, “Power management of hybrid DRAM/PRAM-based main memory,” in Proceedings of the 48th Design Automation Conference, 2011.
58
[14] G. Dhiman, R. Ayoub, and T. Rosing, “PDRAM: A hybrid PRAM and DRAM main memory system,” in Proceedings of the 46th Design Automation Conference, 2009.
[15] N. AbouGhazaleh, B. Childers, D. Mosse, and R. Melhem, “Energy conservation in memory hierarchies using power-aware cached-DRAM,” in Proceedings of the 23rd International Conference on Computer Design, 2005.
[16] H. Zheng, Z. Zhang, E. Gorbatov, and Z. Zhu, “Mini-rank: Adaptive DRAM architecture for improving memory power efficiency,” in Proceedings of the 40th International Symposium on Microarchitecture, 2008.
[17] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-aware intelligent DRAM refresh,” in Proceedings of the 39th International Symposium on Computer Architecture, 2012.
[18] V. D. L. Luz, M. Kandemir, and I. Kolcu, “Automatic data migration for reducing energy consumption in multi-bank memory systems,” in Proceedings of the 39th Design Automation Conference, 2002.
[19] M. Pedram, “Power optimization and management in embedded systems,” in Proceedings of the Asia and South Pacific Design Automation Conference, 2001.
[20] G. Thomas, K. Chandrasekar, B. Akesson, B. Juurlink, and K. Goossens, “A predictor-based power-saving policy for DRAM memories,” in Proceedings of the 15th Conference on Digital System Design, 2012.
[21] I. Hur and C. Lin, “A comprehensive approach to DRAM power management,” in Proceedings of the 14th International Symposium on High-Performance Computer Architecture, 2008.
[22] K. Chandrasekar, B. Akesson, and K. Goossens, “Run-time power-down strategies for real-time SDRAM memory controllers,” in Proceedings of the 49th Design Automation Conference, 2012.
[23] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling,” in Proceedings of the 27th International Symposium on Computer Architecture, 2000.
[24] J. Mukundan and J. F. Martinez, “MORSE: Multi-objective reconfigurable self-optimizing memory scheduler,” in Proceedings of the 18th International Symposium on High-Performance Computer Architecture, 2012.
[25] H. Hanson and K. Rajamani, “What computer architects need to know about memory throttling,” in Workshop on Energy-Efficient Design, 2010.
[26] C. J. Lee, V. Narasiman, E. Ebrahimi, O. Mutlu, and Y. N. Patt,“DRAM-aware last-level cache writeback: Reducing write-caused interference in memory system,” Tech. Rep., Apr.
2010.
[27] M. K. Qureshi, M. M. Franceschini, and L. A. Lastras-Montano, “Improving read performance of phase change memories via write cancellation and write pausing,” in
59
Proceedings of the 16th International Symposium on High-Performance Computer Architecture, 2010.
[28] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2Sim: A simulation framework for CPU-GPU computing,” in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, 2012.
[29] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A cycle accurate memory system simulator,” Computer Architecture Letters, vol. 10, no. 1, pp. 16 –19, Jan.–Jun.
2011.
[30] Cortex-A9 MPCore technical reference manual, ARM, 2012.
[31] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, Sep. 2006.
[32] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 programs:
Characterization and methodological considerations,” in Proceedings of the 22nd International Symposium on Computer Architecture, 1995.