Operations of Two-direction Based NBET

Chapter 3 Proposed Design

3.2 Two-direction Pre-active Policy

3.2.2 Operations of Two-direction Based NBET

Figure 3-3: an example

Here we show that NBET how to work. Branch instruction 2 (see figure 3-3) will be executed after execution of branch instruction 1, for branch instruction 1, we need to record the location in branch target buffer. Assuming that execution of branch instruction 1 finishes in EXE stage of the processor pipeline, if branch instruction 1 is taken or it is in branch target buffer already, its information needs to updated into branch target buffer. We need to record two things: 1) where the entry saving information of branch instruction 1 is located in branch target buffer and 2) whether branch instruction 1 is taken or not. 1) help to find the corresponding entry in BTB and 2) helps to record where next accessed entry is in taken or non-taken path. We use a register called “Location Register” (LR) to record the information (see figure 3-4).

Figure 3-4: while updating branch instruction 1 into BTB, we use a register to record the necessary information

After executing branch instruction 2 in EXE stage of pipeline. if branch instruction 2 is taken or it is in branch target buffer already, its information needs to updated into branch target buffer. Here we have known that where the branch instruction 2 is located in branch target buffer. Then we can use the information saved in LR to find the NBET entry corresponding branch instruction 1 to record the location in branch target buffer of branch instruction 2 (see figure 3-5).

Figure 3-5: while updating branch instruction into BTB, we can write the location information of branch instruction 2 into the NBET entry corresponding branch instruction 1.

Figure 3-6 is the overview of BTB location collection circuit. Figure 3-7 shows that how to write information into location register. The SET of a branch target buffer entry can be get from the index part of PC. We need a encoder to identify the in which way the entry is. We also record whether current is taken or not in the DIR field in location register.. Figure 3-8 shows that we use the information of location register to find corresponding NBET entry. Then need a demultiplexer to decide which field of NBET entry to be wrote into.

Figure 3-6: Next BTB location collection circuit

Figure 3-7: how to record NBET location in LR

Figure 3-8: how to find corresponding NBET entry and write

Figure 3-9: Lookup-circuit of two direction based NBET

Figure 3-10: Pre-activation circuit

Figure 3-9 displays lookup-circuit of two-direction based NBET. BTB is looked up every cycle for target address prediction. If BTB hits, it means that current instruction is a branch instruction and the outputs of BTB row decoder are latched in BTB location register. Then, in the next cycle, the NBET lookup operation is performed to get the next BTB location. Note that this operation is performed only while the BTB lookup is hit in the previous cycle. After NBET lookup operation, the valid entries of the two corresponding NBET entries are latched in pre-activation registers. Figure 3-10 presents the pre-activation circuit. It decodes the contents of the two pre-activation registers to generate the pre-activation signals to power mode controller of drowsy BTB.

3.3 One-direction Pre-activation Policy

3.3.1: Why One-direction Pre-activation

There are two drawbacks for two-direction pre-activation. The first one is that there is a wasted filed at sometimes. For example, unconditional branch instruction is always taken and highly biased branch instruction is frequently taken or not-taken.

Another drawback is that if we always pre-activate the next possibly accessed BTB locations along taken and not-taken paths of a branch, at least one pre-activated BTB entry is unnecessary.

If each NBET entry has only one filed to store the BTB location of next branch instruction, half of NBET size and pre-activation circuits can be saved and the above problems can be ignored. Therefore, the problem becomes how to record the most possible next BTB location in one-field NBET.

Here we introduce one-direction pre-active policy by branch predictor. Because the branch predictor can indicates the direction of a branch instruction, we would pre-activate the possible accessed along the path indicated by branch predictor. And the predicted direction can be gathered during runtime, we need not to record the next accessed entries along both direction, we need to record the next possible accessed entry along the path indicated by branch predictor.

3.3.2: Design of One-direction Pre-activation

Figure 3-11: NBET architecture of one-direction pre-active policy

Figure 3-11 is the architecture overview of one-direction pre-active policy by branch predictor. After execution of branch instruction I, we will encounter branch instruction J or K. The NBET entry corresponding branch instruction I will selectively record where branch instruction J or K is located in branch target buffer. The selection is decided by branch predictor. Because the NBET record should consists with the predicted direction, so the NBET record changes only with predicted direction changing.

Figure 5 shows the state transition diagram of a typical 2-bit branch predictor.

Initially, a taken branch is placed in BTB with weakly-taken (WT) state. The BTB location of next branch instruction along taken path is recorded in NBET. Then, in the following executions of the branch instruction, the NBET update is contents only when the next predicted direction is changed. In 2-bit branch predictor, only while the

predictor state changes from WT to SNT or WNT to ST, the NBET is needed to be update.

Figure 3-12: NBET recording with predictor state

3.3.3 Circuit Modification of One-direction based NBET

Figure 3-13: Next BTB location collection circuit of on-direction based NBET

In order to implement one-direction pre-active policy, we need to gather the kinds of information during run time: 1) whether the predicted direction changes or not and 2) where the entry saving information of a branch instruction is located in branch target buffer (see figure 3-14). While updating a branch instruction into branch target buffer, we can know the predicted direction next time by the state of branch predictor and whether current execution is taken or not. Because one NBET entry has only one field to record the next possible accessed branch target buffer, comparison with two-direction pre-active policy, half of NBET size can be reduced. While we look up branch target buffer and find current PC value is a branch instruction, we can use the value of corresponding NBET entry to pre-activate the next possible accessed branch target buffer entry.

Figure 3-14: write information into location register in one-direction pre-active policy

Figure 3-15: writing information into NBET entry in one-direction pre-active policy The modifications for NBET lookup and pre-activation circuit are trivial.

Therefore, we ignore the description here.

3.4 BTB Entry Deactivation

We adopt decay strategy proposed in [5] to deactivate BTB entries. In this method, a BTB entry is putted into drowsy mode if this entry has not been accessed for a period of time (decay interval). For the implementation of decay idea, a global counter and a set of local counters are required. The global counter reset itself after a period of time (global interval). The local counter adopted for each BTB entries resets itself while the corresponding BTB entry is accessed and increments itself at each time that the global interval is reached. If any local counter reaches its maximum value, the corresponding BTB and NBET entry is putted into drowsy mode. Note that the power modes of NBET are managed together to further save the NBET power overhead.

Figure 3-16: Gating the deactivation signals for most recently accessed BTB and NBET entry

With accurate pre-activation, the decay interval becomes only about hundreds of cycles, since the energy overhead due to power mode changes is very small.

Unfortunately, while program execution flow enters into a large basic block, the previous accessed NBET entry may be deactivated before the next BTB location updating. Therefore, the previous accessed NBET entry should prevent to be deactivated. Figure 3-16 shows its implementation circuits. This circuit the gates deactivation signals for most recently accessed BTB and NBET entry.

3.5 Discussion

There are some situations our proposed pre-active policy can not work well. It is to say that when we access a branch target buffer entry but it is still in drowsy mode.

We will discuss these situations.

First, we define perfect pre-activation. If next branch instruction is in branch target buffer already, it will be translated into to active mode before accessing. If next branch instruction is not in branch target buffer, no branch target buffer would be translated into active mode. If pre-activating a branch target buffer entry violates the roles, it is failed.

The NBET entry corresponding to a branch instruction has no information until next branch instruction finishes execution. Sometimes the next branch instruction is put into branch target buffer than current branch instruction, so the NBET has no information to pre-activate.

Figure 3-17: code segment of a loop

Figure 3-17 is a example of such a situation. When we enter the loop first time, we encounter branch instruction 1 first, if it is taken, it will be put into branch target buffer. In the end of this loop, we encounter branch instruction 3. if it is taken , we will jump to label “Start” and encounter branch instruction 1 next. Since we execute branch instruction 3 first time, we have no information to pre-activate the branch

target buffer entry saving branch instruction 1.

There is a period of time from drowsy mode to active mode and it is called wake-up latency. Assuming the wake-up latency is one cycle penalty. When we encounter continuous branch instruction, since we has accurate information, we still can not pre-activate the entry in time.

In our proposed design, we say that there are at most two possible directions of a branch. We will encounter next branch instructions along both paths. But indirect jump instruction will destroy the sequence. And it lets our pre-activation to be failed.

Figure 3-18: indirect jump destroys the branch instruction execution sequence

In figure 3-18, when we execute upper Call instruction, the processor will enter the subroutine and executes branch instruction 3 than 1. when we execute the second Call instruction, the processor will enter the subroutine and execute branch instruction

3 again, but it will pre-activate the branch target buffer entry saving information about branch instruction 1, not 2.

Since branch target buffer is a cache like data structure, conflict may happen sometimes. Sometimes some NBET entry will indicates the same branch target buffer entry. If a conflict happens and the value of the branch target buffer entry is replaced, Those NBET entries will save invalid information. But when we access these NBET entries, we still use the information to pre-activate.

Branch target buffer is a cache of branch instruction. The same as instruction cache or data cache, the conflict miss will happen on branch target buffer. Once miss happens, a branch instruction will be replaced and corresponding NBET value will be loss. It may make pre-activation fail.

In one-direction pre-active policy, we pre-activate the next possible accessed by the result of branch predictor. If branch predictor predicts error, then our pre-activation will fail, either. In two-direction pre-activation, we don’t care the result of the predicted result, so such situation will not happen, but when NBET has valid value about taken and non-taken path, it will pre-activate one unnecessary branch target buffer entry.

Chapter 4 Evaluation

4.1 Method

It is very difficult to time-consuming to re-design processor and implement my proposed design methods into it. Another approach is using a simulator to simulate the behavior of a processor. It is very commonly used approach in architecture design.

Because it is economic than re-design a processor and it is still accurate. By modifying the simulator, we can observe the result of my proposed design.

I evaluate my design by a simulation-driven simulator. Like a real processor, the simulator simulates the behavior of the components in a real processor. I will gather the execution result of my proposed design through simulating the behavior of my design.

4.2 Evaluation Metrics

In this research, I use the following metrics to evaluate my proposed design:

‧BTB leakage energy consumption

‧Performance loss

This two metrics are meaningful for user.

4.2.1 BTB Leakage Energy Consumption

The purpose of my proposed design is to save “energy consumption”, and I focus on branch target buffer leakage energy. The BTB leakage energy consumptions may include the extra energy caused by additional hardware and performance loss.

Therefore, it composed of the following terms:

‧BTB leakage energy

‧NBET energy

‧System leakage energy duo to performance loss

‧Energy of extra control logics

We defines the terms as follows:

‧ BTB leakage energy

leakage energy consumption of BTB

‧ NBET energy

Leakage energy consumption of NBET

‧ System leakage energy duo to performance loss

There are extra cycles due to performance, the system leakage energy increases because of execution time increment.

‧ Energy of extra control logics

Some hardware are added to implement my design, the extra hardware will consume dynamic energy. It includes the reading and writing NBET, global and local counter to implement decay, and some logic diagram.

The leakage energy of BTB and NBET is calculated by the following equation:

( )

“Active cycles” is the number of cycles that a branch target buffer entry is in active mode and “Active energy per cycle” is the leakage consumption of a branch target buffer during a processor cycle. The sum of every branch target buffer’s leakage

energy consumption is the total leakage energy consumption of a branch target buffer.

4.2.2 Performance Loss

Performance loss means the increment of execution cycle. There is a finite period of time from drowsy mode to active mode and it so called “wake-up latency”. Here we assume wake-up latency is one cycle. If we activate a drowsy branch target buffer entry on demand, the overall system must wait until the entry translated to active mode. Because the execution time increases, the system leakage energy will increase, too. The purpose of my design is to hide wake-up latency. The performance loss is another important metric of my design.

4.3 Environment

The architectural simulator used in this research is the Simplescalar/Alpha 3.0 and xtrem1.0. Simplescalar/Alpha is a commonly used simulation-driven simulator in architecture design domain. It is a suite tools for the Alpha ISA. Xtrem1.0 is a simulator derived form the Simplescalar, but ii is suite to ARM ISA. Table 4-1 are the main parameters of my simulation environment:

Most of the energy numbers are obtained from the power libraries in XTREM [12] tool set. The SRAM energy parameters of different power modes and the mode transition are listed in table 4-2 [13]. The number of execution cycles is obtained from Simplescalar/Alpha 3.0.

Table 4-1: Simulation parameters

Table 4-2: leakage .energy parameters

Parameter Value Active leakage energy per BTB entry 0.33 pJ/cycle

Drowsy leakage energy per BTB entry 0.0495 pJ/cycle

Transition energy 11 pJ

The benchmark I selected in this research are Mibench and SPEC2000 benchmark.

Mibench is a free, commercially representative embedded benchmark suite and consist of six categories

‧ Automotive and Industrial Control

‧ Consumer Device

‧ Network

‧ Office

‧ Security

‧ Telecommunications

SPEC2000 is another commonly used benchmark for high end processor. I use these two kinds of benchmark to examine the effectiveness of my design in different domain of application.

4.4 Experimental Results

Figure 4-1 shows the BTB leakage energy consumption with two-direction pre-activation policy. The Y-axis in the figure is the ratio of branch target buffer leakage energy consumption of my design normalized to original branch target buffer leakage energy consumption. The X-axis are my proposed design with different decay interval.

The most left bar chart of figure is ideal case. From leakage energy parameters, we have the equation:

For a branch target buffer entry, if the times between two successive accessing are more than 40 cycle. We will gain leakage reduction if we put the entry into drowsy mode. If the time is less than 40 cycles, the entry should be in active mode. The ideal case is to obey the above rule, always pre-activate accurately, and has best energy saving.

The most right chart bar is the simulation of related work [5]. I will compare with these two policies.

BTB Leakage energy components in Mibench

Ideal 32 64 128 256 512 1024 2048 Decay only 2K

Decay interval

BTB leakage energy

BTB_S_energy trans_energy S_overhead D_overhead

Figure 4-1: BTB leakage energy components with two-direction pre-activation of Mibench

BTB leakage energy components in Mibench

Ideal 32 64 128 256 512 1024 2048 Decay only 2K

Decay interval

BTB leakage energy

BTB_S_energy trans_energy S_overhead D_overhead

Figure 4-2: BTB leakage energy components with one-direction pre-activation of Mibench

32 64 128 256 512 1024 2048

Decay interval

BTB leakage energy

Two-Direction One-direction

Figure 4.3: comparison of two-direction and one-direction policy in Mibench

BTB leakage energy components in SPEC2000

Ideal 32 64 128 256 512 1024 2048 Decay only 8K

Decay interval

BTB leakage energy

BTB_S_energy trans_energy S_overhead D_overhead

Figure 4-4: BTB leakage energy components with two-direction pre-activation of SPEC2000

BTB leakage energy components in SPEC2000

Ideal 32 64 128 256 512 1024 2048 Decay only 8K

Decay interval

BTB leakage energy

BTB_S_energy trans_energy S_overhead D_overhead

Figure 4-5: BTB leakage energy components with one-direction pre-activation of SPEC2000

32 64 128 256 512 1024 2048

Decay interval

BTB leakage energy

Two-Direction One-direction

Figure 4.6: comparison of two-direction and one-direction policy in SPEC2000

4.5 Discussion

From figure 4-1 and figure 4-2, my design has about 5% better than related work in Mibench. From figure 4-4 and figure 4-5, my design bas about 14% better than related work in SPEC2000. The characteristic of the benchmarks makes the result..

Mibench has smaller loop than SPEC2000 and it has the ratio of branch instruction is smaller than SPEC2000, too. The decay-only strategy has good effect of leakage energy saving already in Mibench. Although we switch the mode of branch target buffer entries more aggressively, the improvement is not obvious. In SPEC2000, my proposed design has better effect. Table 4-3 is my best situation comparing to related work in these two benchmarks.

Table 4-3: best situation in my design

Benchmark Strategy Energy saving Decay only

Mibench One-direction with

From figure 4-3 and figure 4-6, we also find that one-direction pre-active policy is better than two-direction pre-active policy. Although one-direction has poor performance because branch prediction error, it reduces half of NBET size.

Putting branch target buffer entries into drowsy mode more aggressively may has better branch target buffer leakage energy saving, but we will encounter more mode switching. We find that with decay interval decreasing, the leakage energy of branch target buffer decreases and the system leakage energy increases. So the decay interval is non better with smaller value. In my experiment, the best value of decay interval is 128 cycles.

Performance loss is an important metric of my design. Table 4-4 introduces the performance loss in my best situation. my proposed still keeps the performance well.

在文檔中針對低功率drowsy BTB設計之預先開啟機制 (頁 32-0)