Chapter 3 Branch Handling Unit Design
3.3 Design of BHU
3.3.1 Early Branch Target Generator (EBTG)
First we should take a look at the Early Branch Target Generator (EBTG). As mentioned before, the example is shown running Alpha instruction set. Recall Table 3-1, there are 4 different ways to generate target address for branches in Alpha instruction:
1. Perform (PC + 4 + offset) for all PC-relative branches.
2. Access $26 for RETN.
23
3. Access $27 for JSR.
4. Access $28 for JMP.
The actually instruction-set defined branch targets are described as above. Note that the other indirect branch, JSR_COROUTINE, is rarely used and in practical and its register undefined by instruction set; in fact, it does not appear at all in our benchmark set. Assume a single-issue, five-stage pipeline. Since our goal is to perform early branch target generation, we intend to move the above operations, which is meant to be done in EXE stage, to an earlier IF stage. This is when the fast instruction access from instruction buffer (IB) becomes handy.
Target Address Generation for PC-relative Branches
To generate target address for PC-relative branches, offset must be extracted for the addition to operate. Since instruction buffer (IB) is only a line size buffer, instructions inside can be available in a short access time. We propose to set an extra, dedicated adder for BHU to perform PC + 4 + offset, instead of sharing resource with the system pipeline functional unit. The nature of PC-relative branches is pointed out: once the (PC + 4 + offset) addition operation is properly finished, there is no chance that this target address can be wrong. The only thing that can still go wrong is the direction prediction, which is not the part we mean to focus on here.
In a traditional five stage pipeline, since EXE takes up one cycle, we have reason to believe that a simple integer addition operation can be properly finished to match the timing constrain of branch prediction: within the first pipeline stage. Evan in deeply pipelined systems, integer addition is still considered as a short latency operation, comparing to integer multiplication, integer division, floating point operations or memory operations. In addition, many existing techniques have suggested that integer adder can be well customized in timing or area aspect, and it can be easily tuned to incur very short latency. To sum up, integer adder
24
is likely to fit into one cycle time even in high clock rate machines. More timing details are presented in section 3.3.5.
Target Address Generation for Indirect Branches
As for the three specific indirect branches, JMP, JSR, and RETN, we propose to provide their target addresses by using three Register Buffers. Reading the register files could be a straightforward way for target addresses, but it may require another dedicated set of read ports to the register files to avoid hardware conflict. This drawback is unacceptable when it comes to a low power design like this work. So instead, buffering the register value becomes a reasonable alternative solution. It has the advantage of short access latency and low power consumption. Every time there is a write operation destine to the three specific register file entries, $26, $27, $28, a copy of the write value is duplicated into corresponding register buffer. Therefore when a certain indirect branch instruction is encountered, target address can be ready in a buffer access time. Note that the original register read was supposed to be done in the ID stage. Now the operation is done earlier in the IF stage by reading register buffer, there is a chance of suffering from data dependency. And also, conventional register usage may not always be followed. That is, compiler can set its own strategy of optimizing register usage, and violate the conventional rules in practical. Both facts mentioned above may result in a wrongful target address obtainment. However, according to our experiment, there are more than 70% of chances that this simple method provides a correct target address for indirect branches. As for the branches provided with bad target addresses, we propose to store them into RBTB for future correctness.
Some may argue that Return Address Stack (RAS) is another way to offer target addresses. Though RAS is indeed a useful design for RETURN instructions, it doesn’t work for other indirect branches, in our case, the JMP and JSR instructions. And also a typical RAS takes up to 16 or 32 entries for reasonable prediction accuracy, while we proposed only takes
25
up 3 entries for all indirect branches. Recall that this is a research focus on storage reduction, it’s not hard to understand why we preferred to left out RAS and go with register buffers.
Figure 3-5 shows our BHU design for Alpha instruction set. The extra dedicated adder has two inputs as PC and offset, which is available from the instruction buffer. Note that for a 32-bit system, target addresses can be provided in word address instead of byte address. So the (PC + 4 + offset) can actually be done by feeding PC and offset as inputs and 1 as the Carry In bit of the adder. This way, no extra latency but the adder itself should be incurred.
The three register buffers are rather simple to implement. They share the (write) enable signals and write data bus with the physical register entries: $26, $27, $28. The only overhead is the extra bus lines and the three 32-bit buffers storage themselves.
Figure 3-5: Overview of the implementation of a BHU in Alpha instruction set.
Target addresses are generated by adder and register buffers. Final branch target decision is made by Branch Identifier (BI), of which two different proposals are introduced later in this chapter.
26