A DVANCED B RANCH H ANDLING - DESIGN FOR ASYNCHRONOUS VLIW PROCESSOR

CHAPTER 4 DESIGN FOR ASYNCHRONOUS VLIW PROCESSOR

4.3 A DVANCED B RANCH H ANDLING

The general control hazard handling grouped into two types, software solution and hardware solution. The software solution is simple way that inserts enough NOP instructions after branch instruction in compile time. Stalling pipeline is the hardware

solution. As we mentioned before, Muller pipeline is 50% utilization. However, we need to pass one instruction in our processor architecture. To insert NOP or stall pipeline will take 2 steps to pass a useless instruction. We design an efficient method that takes characteristic of 50% utilization.

4.3.1 Flush and Stall Methods

In figure 4.9, we assumed that instruction sequence is BEQ, ADDI, SUB……. In flush method, the ADDI instruction will be flushed as NOP instruction. It takes 2 steps (NOP + bubble) to flush it. The improved method changed the original ADDI instruction into bubble, thus finished the action of return to zero. Since we do not have another instruction between BEQ and SUB, there is no more bubble. We save the original bubble, so we make it efficiently.

Figure 4.9：Branch handling methods

In figure 4.10, we show the pc handling part. Since we have buffer to store three instructions, the “pc” in the PF-latch is not the address of current instruction. The pc

BEQ BUBBLE ADDI BUBBLE SUB BUBBLE

BEQ BUBBLE ADDI BUBBLE SUB BUBBLE NOP

X

(a) Flush method

BEQ BUBBLE ADDI BUBBLE SUB BUBBLE

BEQ BUBBLE ADDI BUBBLE

SUB BUBBLE

X

(b) Improved method

value represents the current program counter. We need to calculate the corresponding address therefore and pass to the next stage. To avoid potential problem, we do not process the branch instruction in decode stage but in the next stage, exe1. If we process branch instruction in decode stage, the core may process another instruction in prefetch stage. That means we read/write pc register in the same time. It violated the principle that we said before. Another problem is that the dispatch stage is definitely bubble token when the decode stage is valid token. So we can only flush instruction in prefetch stage. It makes the branch handling much harder.

Figure 4.10：PC selection architecture

However, we send branch target address and stall control signal to dispatch stage from exe1 stage. The stall block is made by AND gate and connect with stall ctl and the dispatch output (instructions before DP-latch). Therefore, the stall ctl can make any instructions here be empty token. We also write the target address to pc register instead of original pc which

Prefetch

from dispatch stage.

4.3.2 Special Case

For example, our BEQ instruction format is BEQ, rd1, rd2, address, L-bit, P-bit. P-bit is always 0 for branch relative instructions. If rd1 = rd2 then change pc into pc + address.

Since 2-way VLIW architecture, each pc address points to 2 instructions. L-bit (1-bit) is used to indicate position. The normal case is L-bit = 1. When L-bit = 0, we select second instruction of target instruction packet, and we should abandon first one. So we actually write the valid bits into 3’b000 (the target instruction is single-handed in this case, so the buffer will all be empty). When L-bit is 0, the compiler should avoid the situation that the target instruction can be parallelly executed. If not, then it needs 2 times of memory access to load these instructions. Therefore, it loses the benefits of parallel executing. In our approach, we assumed that compiler will not let L-bit = 0 and P-bit = 1 in the same time.

There are two solutions for compiler. We can force the P-bit of target instruction be 0. Or we can insert NOP to the original target address. However, the problem results from VLIW architecture. The special case needs 2 times memory access basically. So it is unnecessary to handle it by hardware if the compiler can solve the problem.

4.3.3 Different Timing Cases

No matter the branch handling method is, there is a critical issue must be guaranteed, the timing issue. Because of the branch handling is certain to communicate between different stages in pipeline architecture. For our branch handling method, the possible combination of valid token and empty token are shown as figure 4.11. We send branch address and stall control signal in exe1 stage so the exe1 is valid token.There are 3 possible conditions.

Figure 4.11：Three possible valid token conditions

Case (a) is an ideal condition and we explained it already. (b) is the case that the instruction which is after branch run slower so that it is still in prefetch stage. In this case, the stall control signal does nothing because it changes empty to empty. Dispatch stage will not receive the signal in the next step obviously. It does not need to flush in the next step anyway. However, the target address still sends to dispatch stage which is empty token actually. The pc_mux will select valid token when the other is empty token so it will write target address to the pc register whatever. Only when all the inputs are valid token then it depends on target address. However, whether the instruction is branch or not, the branch target address should be valid token. Thus we keep the address zero of instruction memory to indicate branch taken or non-taken. If target address is 0, pc_mux will not select address from target address. Case (c) is another special case. It feeds one valid token (branch) first, and feeds next one after finishing handling. Hence it is a simple situation just works like single-cycle machine so that there is no problem.

However, it still has some timing constraints that we referred in chapter 4.6.

(a) (b) (c)

32 pc_mux2 will select the address from interrupt table. After pc changes into interrupt service routine, the routine will save pc and another registers first by push them into memory. After finishing routine, pop them from memory and write back pc.

4.5 Data Flow Chart

Figure 4.12：PC selection flow chart

We show the simple flow chart to expound our design in figure 4.12 and figure 4.13. In

Interrupt No

在文檔中非同步雙道超大指令字組處理器之指令壓縮設計 (頁 37-42)