In recent years, with the pursuit of higher computation power, general purpose processor design focus on performance exploitation. A lot of methods and ideas are introduced to bring more powerful computational ability into reality. One of the major factors that affect execution smoothness is branch instruction. As we know, a branch instruction may cause two possible outcomes: proceed to next instruction sequentially or leap to target address. In order to decide the next move while a branch instruction was encountered, Branch Predictor was hence proposed for the very purpose.
A Branch Predictor is necessarily composed of two major parts: Direction Predictor and Target Address Predictor. The Direction Predictor obviously makes the judgment of whether a certain branch instruction would jump to a distant address or stay put by going inline. The Target Address Predictor on the other hand, can be viewed as an entourage of the Direction Predictor, since the target is only required at the point of certainty of a branch jump. In implementation, both predictors are history-base storage units made of SRAM. Base on the record taken down as the program execution, predictor can make accurate predictions of future behavior for branches encountered again. In nowadays processor designs, a Branch Predictor usually has two separated physical components: a Direction Predictor (for direction prediction) and a BTB (for target address prediction).
Aside from computational ability, power consumption of electronics draws more and more attention today. Portable devices like cell phones and laptop computers strongly depend on longer battery lifetime for a better using experience. Even with plugged devices, we expect less power consumption in order to achieve more energy-efficient solutions. The power consumption of electronics can be put into to two parts: dynamic power and leakage power.
While dynamic power comes from every exercise or access to a circuit component, the
2
leakage power is statically consumed when the device is turned on. It has been shown that under 70nm process, more than 60% of the total system power is statically consumed [1]. In storage units in a system, like SRAM-based Branch Target Buffer (BTB), the power leak is particularly serious, and the degree of this power consumption increase linearly with the storage size.
BTB servers many purposes in a system, one of them is Branch Identification. BTB is accessed in a way similar to cache: indexed by program counter (PC) following by performing a tag comparison to determine a hit. A hit in BTB implies that the current instruction is actually an executed branch. In order to identify branch instructions, every cycle the BTB is accessed by the PC. To improve the hit rate, designers tend to increase the number of BTB entries. And to determine a hit, BTB must maintain a tag bit field in its structure, along with a target address field. The nature of BTB as being a large and frequently-accessed structure has destined it to become a power hungry component in the system. Statistic shows that BTB is the second largest on-chip memory: second to cache systems, and consumes approximately 5%~10% of total power, in processor system such as Pentium Pro and Alpha 21264 [2] [3].
Overall speaking, branch predictor is an irreplaceable unit when it comes to efficient program execution. Meanwhile, it comes with a price of not only chip area but also power consumption. The very existence of the key member of branch predictor, BTB, perfectly demonstrated the dilemma older than time in architecture design philosophy: compromise between cost and performance.
3
1.1 Branch and Target Generation Method
Branch predictor undoubtedly handles branches. First, let’s get to understand what we are dealing with here. We can basically put branches into two different categories according to how their target addresses are generated:
PC-relative Branch - Also known as Direct branch. The PC-relative branches may be conditional or unconditional. Despite of how the condition is set to determine direction, all these branches share the same method of target address generation. That is, target address = [PC + 4 + branch offset]. While the PC is known to system, the branch offset can be extracted from the branch instruction body according to the defined instruction format. Once the PC and instruction is in hand, target address can be generated in just a latency of an Adder. In a conventional five-stage system pipeline, after an instruction is decoded and certainly identified as a branch, its instruction address (PC) and offset extracted are add together during the EXE stage and target address is determined.
Indirect Branch - Indirect branches usually have their target address kept in register file entries. Often times, the instruction set would define certain fixed register entries for this purpose. However, this category of branch may differ from instruction set to instruction set. For example, Alpha instruction set shows very strong characteristic: all indirect branches in Alpha are unconditional (so are also referred to as indirect jumps), and their target address are provided with certain predefined register file entries (i.e. $ra, $pv, $AT). As for other instruction set, like ARM instruction set, indirect branches could be a little tricky to identify, since any MOV instruction with destination of $15 (i.e. the PC) should be treated as an indirect jump instruction. Nevertheless, in ARM instruction, there is still a register
4
file entry defined as Link Register, where an address is meant to be load into as CALL and then later accessed from as RETURN.
According to our experiment results, most branches in common programs are PC-relative branches. The direct branch category counts for approximately 90% of total branches in average, where as the indirect branches count for the rest 10%. Since branches in the same category have similar or even identical target generation method, we suppose it is possible to design a simple mechanism to provide targets for branches with reasonable hardware overhead. And by doing so, branches would have an alternative, light-weighted handler besides the conventional huge, power-hungry BTB.
1.2 Research Motivation
So far we have introduced the purpose of a BTB: branch identification and target providing; how it is one of the largest on chip memory in system: second to cache systems in size; and last but not least, how power-consuming it is: 5%~10% of total power in a general purpose system. In addition, we have discovered that targets stored in BTB are rather easy to obtain. Simply by exercising integer arithmetic unit and access register entries, we can generate the target addresses for all branches.
We’ve found that BTB uses up an inefficient amount of storage for the information it contains. If we can find a way to generate target addresses without sabotaging the whole branch prediction timing scheme, the information load in the BTB can be greatly lessened, thus the number of BTB entries can also be greatly reduced.
Storage reduction of BTB not only linearly lowers the static power consumption, but also dynamic power due to the fact that the smaller structure consumes less power every access.
By both the static and dynamic power reduction, BTB storage reduction contribute even significantly to overall power saving.
5
1.3 Research Objective
We intend to design a mechanism that logically serves the same purposes as the BTB:
branch identification and target address providing. Here we name our design a “Branch
Handling Unit” (BHU), since its original goal is to “handle” branches and ease the information load of BTB. The BHU design is composed of two parts: a Branch Identifier and an Early Branch Target Generator, to provide the above mentioned functions. By implementing BHU into the system, there are three main goals to accomplish:
1. To reduce BTB storage requirement:
By handling branches by BHU, less BTB entries would be required. Less storage benefits both to less power consumption and chip size shrinking. However, this doesn’t mean BTB is no longer necessary in the system. On one hand, due to some branch unpredictable nature (e.g. indirect jumps), storing the last target address for certain branches is still an irreplaceable method. On the other hand, BHU design would face some physical constrains, making BTB still required in the system to function as its counterpart. We name the remaining BTB a Reduced BTB (RBTB), since it has far less number of entries than the original one. This will be discussed in later chapter.
2. To reduce BTB power consumption:
The power reduction part can be evaluated by static and dynamic part.
a. Leakage power:
Leakage power of BTB is linearly affected by the size of the storage used.
Under advanced process today, where static power counts for more than half total system power, it is effective to focus on static power reduction, in the aspect of overall power saving.
6
b. Dynamic power:
Dynamic power of BTB is affected by the power consumption of each access and the number of access. A BTB of less number of entries can offer a lower access power; by processing branches by BHU, less BTB updates need to be done. Both facts noticeably reduce dynamic power of BTB.
c. To maintain or improve system performance:
Branch penalty can dominate the performance of a system. When it comes to branch predictor design, performance degradation is undesired, and under most circumstances, unacceptable. Since BTB count for 5%~10% of system power, any performance degradation would be punished 10X~20X as worse; with the same sense, any performance improvement would be evaluated 10X~20X as better. Our goal is to at least maintain the system performance or even to provide better performance with reasonable cost, while achieving the above mentioned storage and power reduction.
1.4 Organization of this thesis
The remaining chapters of this thesis are organized as follows: In chapter 2, we would provide background knowledge for BTB and related works would be introduced and a brief comparison would be made indicating the opportunity we find worth trying for. In chapter 3, we would present two different approaches of our design and propose a plain evaluation of the two methods; also, challenges encountered in implementation would be discussed and provided with practical methods or conceptual solutions. Chapter 4 would demonstrate the simulation technique and results of this work; some environmental assumption would also be listed in this chapter. And finally, chapter 5, a summary would be made and some future work would be proposed.
7