Conclusions and Future Works - 藉由提早產生分支目標減少分支目標緩衝器記憶體使用量及耗電

In the first part of this chapter conclusion are made. And in the second part possible future works are proposed.

5.1 Conclusions

In this thesis, we presented a Branch Handling Unit (BHU) that dynamically identify and generate target address for both PC-relative and indirect branches. The BHU is suppose to ease information load in the Branch Target Buffer. Target addresses that are able to be generated by BHU need not to register entries in BTB, thus the storage requirement can be lower. Due to some contrains, we still find the BTB cannot be completely eliminated from the system. In the end, the BTB can only be downsized with the help of BHU. In a nutshell, this research provides a way to have a trade-off. By exercising BHU, which is composed of a number of logic units and small size buffers, dynamic power is used to trade for leakage power. The overall outcome is deemed worthy in modern manufacture process, since the leakage power consumption today overwhelms the dynamic exercising power consumption.

Aside from power consumption reduction, the method in this thesis significantly lowered the area requirement of branch predictors. In the aspect of manufacturing, this can be a great advantage considering price and yield. And by reducing the information load, the number of updates in BTB is also decreased, leading to less evictions and replacements. Branches in the BTB stay longer, increasing the size of an abstract viewing window of braches for whom predictions should be made. This gives a good opportunity to improve the overall system performance.

The BHU is designed to be a low-latency, light-weighted unit in the system. For such a simple unit to reduce so many entries in the BTB, it’s not at all absurd to point out that

conventional BTB design wasted an inefficient amount of storage to keep track of the branches behavior. Our experimental results show that up to 85% of the BTB storage can be unnecessary and replaced by the BHU. This fact pushes us to review the nature of history-base predictors. With the chase of higher computational power and process stepping, is every piece of the hardware put to absolutely good places and being well-utilized to every gate? Or we’re just abusing, brutally putting in more and more logic gates into a chip and gain less and less efficiency than they should really deliver?

5.2 Future Works

The future work of our BHU design can be put to three aspects. Each of these ideas targets to the same goal: further power reduction for BTB. The three aspects are: another downsizing for the RBTB, unnecessary BHU + RBTB access filtering, and compiler co-design.

First, the chances of further size shrinking of the RBTB lies in the width shortening of each entry. Since theoretically the information that can be spear from RBTB has been minimized by the means of BHU, the number of bit storage of each entry is the only part we can now attack. The two related works introduced in chapter 2 offer perfect solutions. The RBTB in our system operates exactly as a conventional BTB, thus the two independent methods would have no difficulty to co-exist with our design. By tag and address field shortening, the RBTB may become a buffer so lightweight and fully utilized, and so that every piece of it functions most of its cost.

The second part is the dynamic power reduction. Now that the BHU and RBTB are exercise every cycle, we know it’s actually an overdriven state considering not every instruction is a branch. The unnecessary access filtering can be done separately on RBTB and BHU or as one task. There were a lot of researches proposed for BTB access count reduction

and as the two related works mentioned above, they can also be applied to our design. The method of access filtering can also come from the design itself. Assuming we have a loose timing constrain system, the partial decoding can then be serialized with following target generation or lookup process. Another alternative is to facilitate pre-decode, so that branches are identified even before they enter the pipeline.

Finally, a compiler co-design can improve the efficiency of our current BHU + RBTB design. Needless to say, data dependency has always been an issue that compilers are fighting against. By rearranging the instruction placement in the instruction cache, so that branches stay away from starting position in cache line, the number of unhandleable branches can be decreased. Avoid consecutive branches is another way to kill unhandleable branches. And setting constrains so that every indirect branch targets are kept in instruction-set-defined register can also improve the prediction accuracy in our design.

References

[1] Ram K. Krishnamurthy, Atila Alvandpour, Sanu Mathew, Mark Anders, Vivek De, Shekhar Borkar, “High-performance, Low-power, and Leakage-tolerance Challenges for Sub-70nm Microprocessor Circuits”, ESSCIRC 2002.

[2] Dharmesh Parikhy, Kevin Skadrony, Yan Zhangz, Marco Barcellaz, Mircea R. Stanz,

“Power Issues Related to Branch Prediction”, HPCA 2002.

[3] David Brooks, Vivek Tiwari, and Margaret Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations”, ISCA 2000.

[4] Amirali Baniasadi, and Andreas Moshovos, “SEPAS: A Highly Accurate Energy-Efficient Branch Predictor”, ISLPED 2004.

[5] Barry Fagin, “Partial Resolution in Branch Target Buffers”, IEEE Transactions on Computer 1997.

[6] Jan Hoogerbrugge, “Cost-Efficient Branch Target Buffers”, Euro-Par 2000 Parallel Processing.

[7] Randal E. Bryant, “Alpha Assembly Language Guide”.

[8] Fatemeh Kashfi, and Nasser Masoumi, “Optimization of Speed and Power in a 16-Bit Carry Skip Adder in 70nm Technology”, 2006 IEEE.

[9] Fatemeh Kashfi, and S. Mehdi Fakhraie, “Implementation of a High-Speed Low-Power 32-BitAdder in 7Onm Technology”, ISCAS 2006.

[10] Debabrata Mohapatra, Georgios Karakonstantis and Kaushik Roy, “Low-Power Process-Variation Tolerant Arithmetic Units Using Input-Based Elastic Clocking”, ISLPED 2007.

[11] SimpleScaler, http://www.simplescalar.com [12] SPEC, http://www.spec.org

[13] CACTI, http://www.hpl.hp.com/research/cacti

[14] CACTI V5.3, http://quid.hpl.hp.com:9081/cacti/index.y

[15] Yen-Jen Chang,“An Energy-Efficient BTB Lookup Scheme for Embedded Processors”, IEEE Transactions on Circuits and Systems 2006.

[16] David Brooks, Vivek Tiwari, and Margaret Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations”, ISCA 2000.

在文檔中藉由提早產生分支目標減少分支目標緩衝器記憶體使用量及耗電 (頁 72-76)