CHAP 6 REDUCING BTB ENTRY SIZE THROUGH I-CACHE BASED BTB
6.2 D ESIGN FOR I- CACHE BASED BTB
6.2.2 ICBTB management
Figure 6-1: ICBTB and its integration with a typical pipeline front-end
6.2.2 ICBTB management
Figure 6-1 shows an ICBTB and its integration with a typical pipeline front-end, which is composed of instruction fetch (IF), instruction decode (ID), and execution (EX) stages. During IF stage, ICBTB lookup and instruction fetch are performed simultaneously. The dynamic branch predictor (not shown in Figure 6-1), which is either integrated in the BTB entry or implemented separately, performs the branch direction prediction at the same time. If ICBTB lookup result is a hit, then the fetched instruction is a branch. If the predicted direction is taken, the output of ICBTB, which is the target address of the branch, is used as the next PC. Otherwise, the sequential address is used.
As shown in Figure 6-1, n ICBTB entries are attached to each set of the I-cache (n-grouping ICBTB). In other words, the n ICBTB entries can be used by all branches of the corresponding I-cache set. This implies that the optimal number of n for low-power consumption is highly dependent on the I-cache configurations. We determine n through simulations. In addition, instead of recording the full tag, each ICBTB entry uses an index field to indicate which instruction in the corresponding I-cache set is using this entry. The index field is composed of a way index (w_idx) and a line offset, where the way index represents the cache line index of the set and the line offset indicates the offset within the line. For a typical i-way set-associative I-cache with x instructions per cache line, the width of the index field is log2x+log2i bits.
During EX stage, both the target address and the direction of a branch are available. If the target address of the branch has not been recorded in ICBTB and the branch is taken, then the target address, way index and line offset will be recorded in ICBTB. While the line offset is directly extracted from the address of the branch, the way index is derived from the I-cache circuit. However, if the way index of an instruction is discarded after I-cache access, the way index information is lost when
the instruction enters ICBTB (assume the instruction is a taken branch). Therefore, the way index of each instruction is preserved in the instruction pipeline until EX stage.
Figure 6-2 shows the block diagram of a 2-grouping ICBTB attached to a 2-way I-cache. In this case, the 2-grouping ICBTB is composed of two BTB arrays, and the number of BTB entries in each BTB array is equal to that of the I-cache sets. PC is divided into three parts to access the I-cache and the ICBTB simultaneously, where the index is used to determine the set, the tag is used to determine if there is a match in the I-cache, and the line offset is used to extract the current instruction from the accessed cache lines. Note that this block diagram is a functional description, not a physical implementation.
DECODER DECODER
DECODER DECODER
ENCODER ENCODER
Figure 6-2: 2-grouping ICBTB attached to a 2-way I-cache
As shown in Figure 6-2, circles indicate the micro-operations of an ICBTB and I-cache access. These operations are described as follows:
z RAM accesses: The index part of PC is send to the I-cache RAMs and the ICBTB RAMs. Then, circles 1a and 1b indicate the RAM accesses of the I-cache and the ICBTB, respectively.
z Comparisons: The comparators perform the match operations. Circle 2a indicates the tag comparisons of each I-cache way. The way index is then generated by encoding the comparison results. Circle 2b indicates the way index comparisons and line offset comparisons of each ICBTB array. Then, whether the access to each ICBTB array is a hit is determined by ORing the comparison results and the valid bit.
z Data output: Based on the comparison results, the output instruction (circle 3b) and target address (circle 3d) are selected, and the I-cache hit signal (circle 3a) and ICBTB hit signal (circle 3c) are then produced.
When an I-cache line is replaced, the corresponding ICBTB entries are also invalidated. If a taken branch attempts to enter ICBTB and all the attached BTB entries are used, the LRU replacement operation is performed.
In order to illustrate ICBTB more clearly, we trace the operations in a 2-way I-cache and its attached 2-grouping ICBTB using a code example. Figure 6-3 shows the execution paths for two code segments and the corresponding operations in the I-cache and the ICBTB during the execution of these two code segments. The first code segment (I0~I15) takes place first, and then the second segment (J0~J15). There are four branch instructions: I7, J2, J7, and J15, and their targets are the addresses of I11, J5, J1, and J0, respectively. The stamps from t0 to t8 indicate the time order, and the arrows between the time stamps represent the execution paths: the arrow between t0 and t1 indicates the execution path on the first code segment, and all the remaining
arrows are for the second code segment. The description over each time interval is listed as follows:
z t0->t1: Initially, both the I-cache and the ICBTB are empty. The first code segment is executed completely at t1, the corresponding code blocks are allocated in way 0 of the I-cache, and the first entry of array 0 in the ICBTB is assigned to I7.
z t2->t3: The program continues from J0 to J7. When J0 is fetched, the code block J0 to J7 is allocated in way 1 of the I-cache. After branch J7 is executed at t3, the first entry of array 1 in the ICBTB is assigned to J7.
z t4->t6: The program continues from J1 to J15. When J8 is fetched, the code block J8 to J15 is allocated in way 1 of the I-cache. After branch J15 is executed at t6, the second entry of array 0 in the ICBTB is assigned to J15.
z t6->t7: The program continues in the order of J0, J1, J2, J5, J6, and J7. Since branch J2 is taken and all the ICBTB entries are full, one of the entries should be replaced. In this example, the first entry of array 0 in the ICBTB is discarded.
z t7->t8: The program continues in the order of J1, J2, J5, J6, and J7. Both ICBTB lookups for branches J2 and J7 are hits at this interval.
J0
I-cache way 0 ICBTB array 0 ICBTB array 1
X
valid, w_idx, line offset, target address
Figure 6-3: Code examples to illustrate ICBTB
6.3 Experiments
The objective of the experiments is to evaluate the impact of the proposed ICBTB on energy, performance, and area. The energy model used in this design is the same in that listed in Chap 2-3. Therefore, we report the results of the related metrics for various BTBs here.
6.3.1 Results and analysis
We now examine the interaction among BTB configuration, performance, energy, and storage characteristics for SPEC 2000. In the descriptions below, the term
“average” means the arithmetic mean for the metrics across the simulated benchmark programs. Except for BTB accuracy, and IPC, all metrics reported in this section are normalized with respect to those of a 256-entry 4-way conventional BTB. The term
“Base” means the conventional BTB, “Johnson” Johnson’s BTB, and ICBTB_k the k grouping ICBTB.
6.3.1.1 BTB accuracy and IPC
0%
bzip2 crafty gap gcc gzip eon mcf parser perlbmk twolf vpr avg
NPC accuracy
Base Johnson ICBTB_1 ICBTB_2 ICBTB_4
Figure 6-4: NPC accuracy comparisons
0.0
bzip2 crafty gap gcc gzip eon mcf parser perlbmk twolf vpr avg
IPC
Base Johnson ICBTB_1 ICBTB_2 ICBTB_4
Figure 6-5: IPC comparisons
Figure 6-4 presents the average NPC accuracy, and Figure 6-5 presents the
corresponding IPC. The important observations and analyses of the simulation results are listed as follows:
z Johnson’s BTB has 512 entries, while the baseline BTB only has 256 entries.
However, Johnson’s BTB only achieves a relatively low BTB accuracy and low IPC due to the contention problem of the BTB entry for each I-cache line.
z ICBTB_2, which has the same entries with the baseline BTB, improves the NPC accuracy and IPC, significantly. ICBTB_2 archives similar NPC accuracy and IPC of the baseline BTB. Compared with the baseline BTB, ICBTB_2 only has 2.6% of performance degradation.
6.3.1.2 Energy results
Figure 6-6 shows the BTB dynamic energy, Figure 6-7 shows the BTB leakage energy, and Figure 6-8 shows EBTB&LP. The important observations and analyses of the simulation results are listed as follows:
z Compared with the baseline BTB, Johnson’s BTB has relative high BTB dynamic and leakage energy due to it has more entries.
z ICBTB_2 saves 55.44% and 34.19% of BTB dynamic and leakage energy, respectively. With the overhead energy (Estall) included, ICBTB_2 saves 16.54%
BTB energy, with only 2.6% of performance degradation.
0%
bzip2 crafty gap gcc gzip eon mcf parser perlbmk twolf vpr avg
BTB dynamic energy
Base Johnson ICBTB_1 ICBTB_2 ICBTB_4
Figure 6-6: BTB dynamic energy comparisons
0%
bzip2 crafty gap gcc gzip eon mcf parser perlbmk twolf vpr avg
BTB leakage energy
Base Johnson ICBTB_1 ICBTB_2 ICBTB_4
Figure 6-7: BTB leakage energy comparisons
0%
bzip2 crafty gap gcc gzip eon mcf parser perlbmk twolf vpr avg
BTB energy with LP
Base Johnson ICBTB_1 ICBTB_2 ICBTB_4
Figure 6-8: EBTB&LP comparisons
6.4 Summary
This topic addresses the issue of BTB tag memory cost reductions. Through tag
memory sharing with I-cache, both dynamic and leakage power in BTB are reduced.
Since the BTB entries are shared to the branches in each I-cache set, the performance degradation is insignificant. This method outperforms Johnson’s design.