Chapter 3 Branch Handling Unit Design
3.5 BHU in More Complicated Pipeline
The design of BHU is intended to target general purpose processors. We depicted the design in a conventional single-issue, five-stage pipeline, where the design works smoothly and efficiently. However such pipeline configuration can be considered old-fashioned and improvable through the performance hungry eyes of today. Actually, there are certain things we can do to apply our design to more sophisticated systems and exploit most functionality of our BHU design, if not fully. In this section, we would show how the BHU design adapts to different environment of more complicated pipeline.
Take a quick glance at the modern architecture. It’s not hard to find that with the pursuit of higher performance, systems today have some major improvement implemented, which
38
leads to more complicated pipeline designs. There are many different ideas fulfilled in nowadays processor designs, the two mainstream aspects are to increase issue width and clock rate. By realizing multiple-issue, the pipeline can gain theoretical multiple through put; and driving up the clock rate shortens the time that pipeline periodically delivers the finished instructions. Both methods are common strategy implemented to increase the system performance and these techniques achieved major efficiency breakthrough for the past decades. In a system of wide issue width and high clock rate, changes can be made for our design to adapt. Note that out-of-order execution pipeline does not at all complicate the design of BHU. Since even the instructions can be executed in out-of-order style, the nature of human-written program is still sequential. So the pipeline frontend, where instructions are fetched and branch predictions are made, still goes in-orderly. In this section, we would focus on the polymorphism of BHU in wider issue width and higher clock rate systems.
3.5.1 Multiple Issue Systems
Firstly, we should take a look at multiple issue architecture. Assuming an n-issue system, n instructions are fetched from cache every cycle. As a guide of the fetcher, it is the branch predictor’s task to identify branch instructions among these n instructions and provide possible target addresses and credible taken or not taken forecasts. It is obvious that the only thing differs from what has been depicted in aforementioned single-issue pipeline is the quantity. The number of prediction that should be made in one cycle is a not much of a deal to handle. A straightforward idea emerges as to duplicate the BHU for each pipe in such a system. Figure 3-16 shows the case of a 2-issue pipeline.
39
Figure 3-16: BHU adaption in a 2-issue pipeline.
Figure 3-16 seems to work fine, but actually there could be BHU hardware redundant in this scenario. Recall that the BHU contains two parts of hardware for both PC-relative and indirect branch. According to earlier assumption, the PC-relative route, which lies on an integer adder, would be on the critical path. Given the tight timing constrain, it would be risky for the two instructions to utilize one integer adder sequentially (adder works in its own clock and calculate two target addresses in one system cycle; buffer must be assigned to latch outputs). But for the indirect branch, however, we can save a duplication of the register buffers. The register buffers contain the values of corresponding register entries that should be ready to be accessed as target address by this moment. Hence one copy of the register buffers ought to be sufficient for this purpose. Figure 3-17 shows the version of this redundant elimination. In addition, the two Partial Decoders (PDs) may have a chance to be further simplified. Since partial decoding actually incurs very short latency (which is shown to be equal to three AND sequential gates delay), instructions of different pipes have a chance to utilize a common Partial Decoder (PD) in turn (same way as the adder works described above) without breaking any timing constrain. Of course, this is another decision that should be made based on the issue width, cycle time, and PD latency.
40
Figure 3-17: BHU adaption after eliminating the redundant parts.
3.5.2 High Clock Rate Systems
Secondly, the high clock rate issue. A system with high clock rate can significantly shorten the amount of time for the BHU to work within. Although this may seem terrifying given that the branch prediction is a now-or-never job at the very frontend of system pipeline, in practice, short cycle time is not as scary as it seems. By taking a closer look at the components of the BHU, we can find it to be a lightweight and short latency unit.
Figure 3-19 shows each and every part of the BHU design. As can be seen in the figure, instruction buffer can be accessed in a very short latency. The Partial Decoder (PD) consists of four parallel AND, one with a two-bit input and the other three with a six-bit input. The critical path of this part is clearly dominated by six-bit input AND, which in the worst case can be formed by three levels of two-bit AND gates. The integer adder, which can first be shorten in length and as mentioned early in this thesis, can be well customized and clocked up
41
to 10GHz in practical [8][9][10], is likely to meet the timing requirement. The set to zero and cascade operation following integer addition barely cause any delay. The three register buffers are even more accessible with short latency given each is of just 32-bit in size. And the MUX can be implemented by pass transistors to lower pass through latency.
Figure 3-18: Components in BHU.
As a matter of fact, the true challenge of applying BHU to a high clock rate system lies in instruction buffer refilling. The pipeline design in such systems tends to be deeper than the conventional five stages. And each task is divided into more coarse grain subtasks spread over more pipeline stages. In this case, the assumption that instruction can be fetched from the instruction cache in one cycle may not stay true. Thus the number of cycles required to complete an IB refill may also be strained. Since every cycle needed to complete an IB refill causes an unhandleable instruction to BHU, it is evident how this is going to affect our design.
The number of cycles required to access instruction cache varies from processor to processor. Most state-of-the-art architecture design well-known today wouldn’t take more
42
than three cycles to access instruction cache. For example, Intel Itanium processor only need one cycle for instruction cache access, clocking at around 1GHz; while ARM9 based processors often require two to three cycles to fetch an instruction. We intend to turn this into an experimental parameter, by simulating different settings and find out the overall applicability of the BHU design in modern pipeline systems. The results would be presented in the next chapter.
3.5.3 Summary of BHU Adaption
To sum up, BHU encounters two major challenges when it comes to sophisticated pipeline adaption. Issue width problem can be brutally handled by providing more hardware resource to the BHU. High clock rate issue is trickier due to the fact that it involves instruction buffer refilling, which is the most critical weak part of the BHU design. As the number of cycle required for instruction cache access is put into consideration, the problem can be transform to how many instructions would be unhandleable after an instruction cache line change. The Reduced BTB (RBTB) serves as a counterpart to complete the BHU functionality as the original expectation. And in system that unhandleable cases increase, heavier workload is anticipated for RBTB.
43