一、 Introduction
1.3 Organization of this Thesis
The rest of this thesis is organized as follows. Chapter 2 will illustrate my research background. Chapter 3 explains the design detail of iAIM plus line buffer, iAIM plus loop buffer, modified iAIM plus loop buffer and multiplex bus under single cycle instruction fetch architecture or multiple cycle instruction fetch architecture. Chapter 4 presents evaluation methodology, experiment results and discussion. Conclusion and future works are then provided in Chapter 5.
Chapter 2
Background and Related Works
In this chapter, we will describe the design idea and architecture of iAIM firstly. Then we will present about the buffering mechanism for saving instruction memory access power.
2.1 Intelligent Autonomous Instruction Memory
The author of [1] organizes TLIM and BTB as iAIM. Because of the sequential
characteristic of program execution and BTB's ability of predicting branch instruction, iAIM can generate instruction address by itself and then send instruction to CPU core. Besides, the author proposes another two enhanced design. One of them equips iAIM with a partial instruction decoder capable of calculating branch target address by decoding branch
instruction. The other one equips iAIM with a partial instruction decoder and a return stack.
The experiment results show three proposed designs can reduce instruction address transmission to 97.71%, 98.49% and 99.99% and reduce total bit transitions to 84.99%, 86.54% and 92.01% compared with conventional architecture respectively. All these designs greatly outperform T0 encoding technique. The third design outperforms T0 DAT with 128 entries technique slightly.
2.1.1 Mechanism of iAIM
During program executes, it can be divided into sequential execution and taken branch execution. Sequential execution occupies about 85-90% portion of program execution, while the other occupies about 10-15%. When program executes sequentially, next instruction address comes from current instruction address plus instruction size. But when program does
not execute sequentially, next instruction address comes from complicated mechanism. And next instruction address has less relationship with current instruction address. For taken branch execution instructions, it can be divided into two categories-fixed target branches and changing target branches. Most taken branches are fixed target branches and these instructions can be handled by BTB. Therefore, the basic two ideas that iAIM can generate next
instruction address by itself are:
(1) For sequential execution instructions, with help of adder we can generate next instruction address.
(2) For fixed target taken branches, we can get branch target address with help of BTB.
Under the mechanism of iAIM, those changing target branches besides procedure return will be the remainder instruction addresses that CPU core needs to transfer to iAIM.
2.1.2 Control Signal between CPU core and iAIM
Although iAIM can generate most of instruction addresses by itself, there are still a few instrctuion addresses in want of CPU core transferring. And because BTB is moved from CPU core to instruction memory, the execution results of branch instruction need to be communicated between CPU core and iAIM. Therefore we need to introduce some additional control signals for transferring these messages. Additional control signals for iAIM design display as Figure 2.1.
Figure 2.1: Additional control signals between CPU core and iAIM
P-Taken: (control signal from iAIM to CPU core)
P-Taken=1 : predictive taken branch was found at last clock cycle P-Taken=0 : otherwise
Because BTB is moved into iAIM, iAIM needs to tell CPU core that the branch prediction direction by BTB when taken branch happens.
S-Indicate: (control signals from CPU core to iAIM) Those messages CPU core needs to tell iAIM include:
(1) Which one instruction address should iAIM use? The instruction address generated by iAIM it self or the instruction address transferred by CPU core.
(2) When branch miss prediction happens, CPU core needs to transfer corrected instruction address to iAIM.
(3) Pipeline stall happens inside CPU core.
After appropriate encoding, we can use 2 additional control signals to generate 4 different state to transfer messages what we mention above.
Autonomous (00):
CPU will not transfer instruction address to iAIM, and iAIM will generate instruction address by itself.
Pipeline Stall (01):
CPU will stall for one clock cycle, and iAIM will send the same instruction as the previous clock cycle.
Wrong Prediction (10):
CPU detects branch miss prediction happens, and will send corrected instruction address to iAIM.
Compulsory (11):
Exception situation (including system initiates and procedure return), CPU will send instruction address to iAIM.
2.1.3 Block Diagram of iAIM
The block diagram of iAIM displays as Figure 2.2.
Figure 2.2: Block Diagram of iAIM
PC Incrementer is used for generating instruction address when the program is sequential execution. And BTB can give the branch target address when taken branch happens. The two 34-bit registers (PC-1(PCt-1, InBTBt-1, PTaken-1), PC-2(PCt-2, InBTBt-2, PTaken-2)) is designed for updating branch prediction result. The reason is that the actual branch execution result will be available at the 3rd stage of pipeline. The last PC on purple line comes from PC-1 works when the pipeline stall happens in CPU core. The Address from CPU on red line comes from CPU core is used for transferring instruction address when the instruction address can not be generated by iAIM itself.
Partial decoder is designed for recognizing some special instructions, like branch instruction, procedure call and procedure return. With the help of partial decoder, iAIM is able to generate the both two branch results (branch target address and fallthrough address) in time. Because of this, when branch miss prediction happens, CPU core need not to transfer the corrected branch instruction address, and what CPU core need to do is to inform the happening of the situation. Return Stack is used to save the return address of procedure return.
When the program goes into a new procedure, the return address will be pushed into return stack. And when the program gets out from the procedure, the return address will be popped from return stack. With the help of partial decoder, iAIM is able to detect the happening of procedure call and procedure return. And with the existence of return stack, the return address will also need not to be transferred from CPU core. Then the traffic from CPU core can be offloaded.
2.2 Line Buffer
Several kinds of methods can be used for saving IM power, like drowsy cache, cache decay, buffering mechanism etc. But, drowsy cache [6] and cache decay [7] accompany performance degradation.
In [6], drowsy cache saves the instruction memory power by lowering supply voltages of cache lines. The contents in the cache line will not be destroyed because of the low supply voltage. But, to access the contents in the cache line will need 1-2 clock cycle because the line must be reinstated to a high-power mode. Therefore, this incurs system performance loss.
In [7], cache decay saves the instruction memory power by gating supply voltages of cache lines. The contents in the cache line are destroyed because the supply voltage is gated.
The basic idea of cache decay is that before the cache lines are evicted from instruction
memory, they have a period of "dead time". The author tries to turn off these cache lines when they get into dead time. The drawback of this approach is that the state of the cache line is lost
when it is turned off and reloading it from the level 2 cache has the potential to negate any energy savings and have a significant impact on performance. Therefore, we adopt the buffer mechanism for saving IM access power.
In [2], author proposed line buffer design. Line buffer is a cache line size buffer placed in front of instruction cache to capture temporal and spatial locality of reference. The block diagram of line buffer is as figure 2.3.
Figure 2.3: Block Diagram of Line Buffer The mechanism of line buffer describes as below:
(1) Once an instruction is accessed,
the line containing that instruction is transferred to the line buffer.
(2) The next access, if sequential,
will be accessed from the line buffer instead of the cache.
Because of the mechanism describes above, instruction cache accessing power can be saved.
The advantage of line buffer design is that instruction reusing can start quickly if the subsequent instruction is on the same cache line.
However, the design idea of line buffer may not be appropriate for our design. The reason is that inside line buffer design line buffer need to be placed close to instruction cache.
This situation accompanies high communication bus cost between line buffer and instruction cache. And this problem can be ignored inside loop buffer design because line buffer is placed close to instruction cache. However, as we introduce line buffer design idea into our design, line buffer and instruction cache are placed in two separate blocks. This contravenes the design idea of line buffer.
2.3 Loop Buffer
2.3.1Basic Idea of Loop Buffer
Loop buffer [3] used a tagless buffer storing those instructions inside innermost loop.
Loop buffer design tries to utilizes the observation of that program execution spends most of time in loop execution. With loop buffer, CPU core can fetch instructions from loop buffer instead of instruction cache, and then instruction cache accessing power can be saved. The author of [3] introduces a 3-phase buffer managing mechanism. The whole mechanism describes as figure 2.4:
Figure 2.4: Memory Accessing in Different States
The operation status consists of three states: IDLE, FILL and ACTIVE. In the figure 2.4, the gray rectangle and the solid black line are currently accessing block and bus respectively during different states. When CPU core initializes or resets, Loop Buffer Controller (LBC) enters IDLE state first. During IDLE state, LBC detects an innermost loop. If LBC finds that the same innermost loop has been filled into loop buffer, LBC enters ACTIVE state,
otherwise LBC enters FILL state. The state diagram of the LBC’s finite state machine (FSM)
is shown in the figure 2.5.
Figure 2.5: State Diagram of Loop Buffer Controller
The transition between different basically depends on loop detection. Besides this, when branch miss prediction happens, the state will go back to IDLE state (J, G). And LBC can also handle the situation when the loop size is bigger than loop buffer size (F, K). Under this mechanism, loop buffer can be easily controlled by a few control signals. Since we need only three state to handle the access of loop buffer, the extra control signals we need will be only two. (2*2 = 4 > 3)
For meeting above state diagram, additional hardware is proposed. An extra register L_addr records the start or end address of an innermost loop been stored in loop buffer.
Another register L_leng records how many instructions are stored in loop buffer when the instructions are filled into loop buffer.
2.3.2Loop Buffer with Forward Branch
Inside previous loop buffer design [3], loop buffer can only handle innermost loop. For those innermost loops containing forward branch or procedure call, loop buffer can only buffer those instructions till the happening of forward branch or procedure call. Those instructions after forward branch or procedure call must be fetched from instruction memory without reusing. This greatly reduces the efficiency of loop buffer.
Figure 2.6: Profile the Execution Time of MiBench
As figure 2.6 shows, only 32.90% of the whole program can be completely buffered into loop buffer. And about 38.32% of the whole program cannot be completely filled into loop buffer. If this problem can be solved, the efficiency of loop buffer will be greatly improved.
According to this, many ideas are proposed to solve this problem. Nevertheless, on design complexity dictates most loop buffer designs to store only innermost loops without forward branch or instructions within innermost loops before a forward branch or procedure call. [4] proposes a simple and effective way to cope with this complexity. Author supposes that BTB is a norm in most designs, if we add an extra bit in BTB, indicating if the loop buffer stores the fall-through or target trace after a within-the-innermost-loop forward branch,
then much of the complexity can be avoided.
To store no-loop-inside subroutine in loop buffer, loop buffer must handle about the situations of procedure call and procedure return. The basic idea of handling subroutine return is to fill but disregard these invalid instructions followed with subroutine return. Since a subroutine return is an always-taken branch, CPU core will fetch G invalid instructions.
According to when the subroutine return is detected by CPU core, G has two different values:
(1) G is number of pipeline stages between IF and EXE; and
(2) G is number of pipeline stages between IF and instruction decoder (ID).
However, these invalid instructions would be automatically flushed by CPU core and do not affect the correctness of program. If we store G invalid instructions into loop buffer, the instruction fetch sequence is held. This makes loop buffer can store forward branch or procedure call inside innermost loop.
2.4 Nested Loop Buffer
If we observe figure 2.6 more closely, it will have about 8.56% of the whole program we can still improve. The problem is handling nested loop inside loop buffer. The author of [5]
observes that the use of backward branch (BB) instructions of a nested loop has the characteristics of "First in, Last out" just like the using characteristics of stack.
The use of BB of a nested loop is:
After the program executes once the BB of outer loop, the program must execute completely those BB inside the outer loop. Then the program will execute this BB again.
The mechanism of applying nested loop to a stack is as follows.
When a loop enters, BB instruction will be pushed to the top of stack. When the loop exits, this BB instruction will be popped from the stack.
Each entry inside stack will contain 5 components:
BB instruction address, fill bit, up index, down index, up instruction no.
As figure 2.7-1 and figure 2.7-2:
Figure 2.7-1: Code Sequence of a Nested Loop
Figure 2.7-2: Loop Buffer Storing Nested Loop
Fill bit:
For recording if the segment between the exit of inner loop and the exit of the adjacent outer loop is filled in buffer
Up/Down index:
For pointing out where the up/down of loop body is storing inside loop buffer Up instruction no:
For recording the instruction count of up part of loop body stored inside loop buffer An example for illustrating how to use a stack and stack entries to control the buffering of a nested loop shows in Figure 2.8.
Figure 2.8: An Example of Buffering Nested Loop
Before we present this example, we first define some notations. S(L) means the
processor is executing the backward branch instruction of the closest loop L and the next state of the Loop Buffer Controller is S, where S = {I(IDLE), F(FILL), A(ACTIVE)}, top points to the top of the stack. X and Y stand for nested loop as shown inside Figure 2.7-1.
This example begins with an IDLE state and top is 0 as shown in Figure 2.8(a). When an backward branch instruction BB_X of loop X is detected the first time, the controller enters the FILL state and pushes the address of BB_X to stack as shown in Figure 2.8(b). The value of fill is set to 0 because instructions of loop X are not filled to buffer yet. In FILL state, the loop of X is executed in the second iteration and the instructions are buffered. When BB_X is executed the second time, the address on top of stack is compared with PC. Since they are identical, this BB_X instruction becomes a triggering backward branch instruction. The state of controller is set to ACTIVE and the fill bit is set to 1 as shown in Figure 2.8(c). In the subsequent iterations, the controller remains in ACTIVE state till BB_X instruction is executed and found to be not-taken, i.e. loop X exits. When the processor exits from loop X, the pop operation to the stack is performed. The result is shown in Figure 2.8(d).
When the backward branch instruction BB_Y, at the end of loop Y is executed the first time, the address of BB_Y instruction is pushed to the top of stack and the state of the controller is transferred to FILL state as shown in Figure 2.8(e). Since loop Y is not filled to buffer yet, the value of fill is set to 0. When the loop X in loop Y is executed the second time and BB_X in loop X is executed the second time, we know that loop X is already in the buffer.
Hence, the value of fill is set to 1 and the state is transferred to ACTIVE state as shown in Figure 2.8(f). When loop X exits, the stack is popped. At the same time, the controller will check if the value of top-1 equals 0. If not, it means that there is a nesting loop and the controller will check if the fill field of top-1 of the stack is 0 (not filled) or 1 (already filled).
Since the value is 0, the controller will enter FILL state as shown in Figure 2.8(g) and the segments between BB_X and BB_Y will be filled in buffer. When BB_Y is executed the second time, its address is compared with the top stack and found to be equal. This backward
branch instruction becomes triggering backward branch and the controller goes to ACTIVE state. Since loop Y is already in buffer, the fill field is set to 1 as shown in Figure 2.8(h).
When loop Y exits, the top of stack is popped as shown in Figure 2.8(i).
Chapter 3
Design of Proposed Architecture
In this chapter, we propose our architecture. In section 3.1 we present the architecture of AIM with loop buffer. In section 3.2 we present the architecture of AIM with modified loop buffer for handling nested loop problem. In section 3.3 we describe the architecture of merging instruction address bus and instruction content bus into unified multiplex bus under single cycle instruction fetch architecture and multi-cycle instruction fetch architecture.
3.1 Proposed Design of AIM with Loop Buffer 3.1.1Challenges in Design
When loop buffer is added into CPU core for saving IM power, problems of buffer maintenance and instruction content source choice are introduced and need to be solved.
These problems arise from separated loop buffer and loop buffer controller. To integrate loop buffer into system with AIM for saving instruction content bus power, loop buffer have to be allocated inside CPU core. And loop buffer controller need to be allocated inside AIM because it need to get branch information coming from BTB plus control IM access and instruction content bus activity. The problem of buffer maintenance means buffer controller need to enter, update buffer entries inside CPU core. And the problem of instruction content source choice means that CPU core need to know when to use the previous saved instruction inside buffer and when to use the value on Instruction Content Bus prepared by AIM.
3.1.2Key Ideas in Design
For overcome the design challenges proposed inside previous section, our primitive design idea is to add additional control buses and to design an efficient communication protocol. This shows in Figure 3.1.
Figure 3.1: Primitive Design Idea of AIM with Loop Buffer
3.1.3My Proposed Design
Based on the ideas proposed in previous section and the ideas proposed in loop buffer [3], to integrate loop buffer into system with AIM, what we need to do is to increase 2 additional
Based on the ideas proposed in previous section and the ideas proposed in loop buffer [3], to integrate loop buffer into system with AIM, what we need to do is to increase 2 additional