Reducing Branch Costs with Dynamic Hardware Prediction

Z Stop the machine and ring the warning bell

3.4 Reducing Branch Costs with Dynamic Hardware Prediction

pipeline, the type of predictor, and the strategies used for recovering from misprediction.

Basic Branch Prediction and Branch-Prediction Buffers

The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. This scheme is the simplest sort of buffer; it has no tags and is useful only to reduce the branch delay when it is longer than the time to compute the possible target PCs. We don’t know, in fact, if the prediction is correct—it may have been put there by another branch that has the same low-order address bits. But this doesn’t matter. The pre-diction is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back. Of course, this buffer is effectively a cache where every access is a hit, and, as we will see, the performance of the buffer depends on both how often the prediction is for the branch of interest and how accurate the prediction is when it matches. Before we analyze the performance, it is useful to make a small, but important, improvement in the accuracy of the branch prediction scheme.

This simple one-bit prediction scheme has a performance shortcoming: Even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken. The following example shows this.

E X A M P L E Consider a loop branch whose behavior is taken nine times in a row, then not taken once. What is the prediction accuracy for this branch, assuming the prediction bit for this branch remains in the prediction buffer?

A N S W E R The steady-state prediction behavior will mispredict on the ﬁrst and last loop iterations. Mispredicting the last iteration is inevitable since the pre-diction bit will say taken (the branch has been taken nine times in a row at that point). The misprediction on the ﬁrst iteration happens because the bit is ﬂipped on prior execution of the last iteration of the loop, since the branch was not taken on that iteration. Thus, the prediction accuracy for this branch that is taken 90% of the time is only 80% (two incorrect pre-dictions and eight correct ones). In general, for branches used to form loops—a branch is taken many times in a row and then not taken once—

a one-bit predictor will mispredict at twice the rate that the branch is not taken. It seems that we should expect that the accuracy of the predictor would at least match the taken branch frequency for these highly regular

branches. n

To remedy this, two-bit prediction schemes are often used. In a two-bit scheme, a prediction must miss twice before it is changed. Figure 3.7 shows the ﬁnite-state processor for a two-bit prediction scheme.

The two-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2ⁿ– 1: when the counter is greater than or equal to one half of its maximum value (2^n–1), the branch is predicted as taken; otherwise, it is predicted untaken. As in the two-bit scheme, the counter is incremented on a taken branch and decremented on an un-taken branch. Studies of n-bit predictors have shown that the two-bit predictors do almost as well, and thus most systems rely on two-bit branch predictors rather than the more general n-bit predictors.

A branch-prediction buffer can be implemented as a small, special “cache”

accessed with the instruction address during the IF pipe stage, or as a pair of bits attached to each block in the instruction cache and fetched with the instruction. If the instruction is decoded as a branch and if the branch is predicted as taken, fetching begins from the target as soon as the PC is known. Otherwise, sequential fetching and executing continue. If the prediction turns out to be wrong, the pre-diction bits are changed as shown in Figure 3.7.

FIGURE 3.7 The states in a two-bit prediction scheme. By using two bits rather than one, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a one-bit predictor. The two bits are used to encode the four states in the system. In a counter implementation, the counters are incremented when a branch is taken and decremented when it is not taken; the counters saturate at 00 or 11. One complication of the two-bit scheme is that it updates the prediction bits more often than a one-bit predictor, which only updates the prediction bit on a mispredict. Since we typically read the prediction bits on every cycle, a two-bit predictor will typically need both a read and a write access port.

Taken

Taken Taken Not taken

Not taken

Not taken Not taken Predict taken

Predict taken 10

Predict not taken 01

Predict not taken 00

Although this scheme is useful for most pipelines, the ﬁve-stage, classic pipe-line ﬁnds out both whether the branch is taken and what the target of the branch is at roughly the same time, assuming no hazard in accessing the register speci-ﬁed in the conditional branch. (Remember that this is true for the ﬁve-stage pipe-line because the branch does a compare of a register against zero during the ID stage, which is when the effective address is also computed.) Thus, this scheme does not help for the ﬁve-stage pipeline; we will explore a scheme that can work for such pipelines, and for machines issuing multiple instructions per clock, a lit-tle later. First, let’s see how well branch prediction works in general.

What kind of accuracy can be expected from a branch-prediction buffer using two bits per entry on real applications? For the SPEC89 benchmarks a branch-prediction buffer with 4096 entries results in a branch-prediction accuracy ranging from over 99% to 82%, or a misprediction rate of 1% to 18%, as shown in Figure 3.8.

To show the differences more clearly, we plot misprediction frequency rather

FIGURE 3.8 Prediction accuracy of a 4096-entry two-bit prediction buffer for the SPEC89 benchmarks. The misprediction rate for the integer benchmarks (gcc, espresso, eqntott, and li) is substantially higher (average of 11%) than that for the FP programs (aver-age of 4%). Even omitting the FP kernels (nasa7, matrix300, and tomcatv) still yields a higher accuracy for the FP benchmarks than for the integer benchmarks. These data, as well as the rest of the data in this section, are taken from a branch prediction study done using the IBM Power architecture and optimized code for that system. See Pan et al. [1992].

18%

tomcatv

spice SPEC89

benchmarks

gcc

2% 4% 6% 8% 10% 12% 14% 16%

12%

10%

18%

nasa7 matrix300

doduc

fpppp

espresso eqntott

Frequency of mispredictions

than prediction frequency. A 4K-entry buffer, like that used for these results, is considered large; smaller buffers would have worse results.

Knowing just the prediction accuracy, as shown in Figure 3.8, is not enough to determine the performance impact of branches, even given the branch costs and penalties for misprediction. We also need to take into account the branch fre-quency, since the importance of accurate prediction is larger in programs with higher branch frequency. For example, the integer programs—li, eqntott, espresso, and gcc—have higher branch frequencies than those of the more easily predicted FP programs.

As we try to exploit more ILP, the accuracy of our branch prediction becomes critical. As we can see in Figure 3.8, the accuracy of the predictors for integer programs, which typically also have higher branch frequencies, is lower than for the loop-intensive scientiﬁc programs. We can attack this problem in two ways:

by increasing the size of the buffer and by increasing the accuracy of the scheme we use for each prediction. A buffer with 4K entries is already large and, as Figure 3.9 shows, performs quite comparably to an inﬁnite buffer. The data in Figure 3.9 make it clear that the hit rate of the buffer is not the limiting factor. As we mentioned above, simply increasing the number of bits per predictor without changing the predictor structure also has little impact. Instead, we need to look at how we might increase the accuracy of each predictor.

Correlating Branch Predictors

These two-bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch. It may be possible to improve the prediction accuracy if we also look at the recent behavior of other branches rather than just the branch we are trying to predict. Consider a small code fragment from the SPEC92 benchmark eqntott (the worst case for the two-bit predictor):

if (aa==2) aa=0;

if (bb==2) bb=0;

if (aa!=bb) {

Here is the MIPS code that we would typically generate for this code fragment assuming that aa and bb are assigned to registers R1 and R2:

DSUBUI R3,R1,#2

BNEZ R3,L1 ;branch b1 (aa!=2)

DADD R1,R0,R0 ;aa=0

L1: DSUBUI R3,R2,#2

BNEZ R3,L2 ;branch b2(bb!=2)

DADD R2,R0,R0 ; bb=0

L2: DSUBU R3,R1,R2 ;R3=aa-bb

BEQZ R3,L3 ;branch b3 (aa==bb)

Let’s label these branches b1, b2, and b3. The key observation is that the behavior of branch b3 is correlated with the behavior of branches b1 and b2. Clearly, if branches b1 and b2 are both not taken (i.e., the if conditions both evaluate to true and aa and bb are both assigned 0), then b3 will be taken, since aa and bb are clearly equal. A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior.

Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two-level predictors. To see how such

predic-FIGURE 3.9 Prediction accuracy of a 4096-entry two-bit prediction buffer versus an infinite buffer for the SPEC89 benchmarks.

nasa7 1%

matrix300 0%

tomcatv 1%

doduc

spice SPEC89

benchmarks

fpppp

gcc

espresso

eqntott

0% 2% 4% 6% 8% 10% 12% 14% 16% 18%

4096 entries:

2 bits per entry

Unlimited entries:

2 bits per entry Frequency of mispredictions 5%

12%

11%

18%

10%

tors work, let’s choose a simple hypothetical case. Consider the following simpli-ﬁed code fragment (chosen for illustrative purposes):

if (d==0) d=1;

if (d==1)

Here is the typical code sequence generated for this fragment, assuming that d is assigned to R1:

The branches corresponding to the two if statements are labeled b1 and b2. The possible sequences for an execution of this fragment, assuming d has values 0, 1, and 2, are shown in Figure 3.10. To illustrate how a correlating predictor works, assume the sequence above is executed repeatedly and ignore other branches in the program (including any branch needed to cause the above sequence to re-peat).

From Figure 3.10, we see that if b1 is not taken, then b2 will be not taken. A cor-relating predictor can take advantage of this, but our standard predictor cannot.

Rather than consider all possible branch paths, consider a sequence where d alter-nates between 2 and 0. A one-bit predictor initialized to not taken has the behav-ior shown in Figure 3.11. As the ﬁgure shows, all the branches are mispredicted!

Initial value

of d d==0? b1

Value of d

before b2 d==1? b2

0 yes not taken 1 yes not taken

1 no taken 1 yes not taken

2 no taken 2 no taken

FIGURE 3.10 Possible execution sequences for a code fragment.

d=?

FIGURE 3.11 Behavior of a one-bit predictor initialized to not taken. T stands for taken, NT for not taken.

在文檔中 Fundamentals of Computer Design 1 (頁 194-200)