Index Buffer Approach - C OMPRESSION I MPLEMENTATION

CHAPTER 4 DESIGN FOR ASYNCHRONOUS VLIW PROCESSOR

4.2 C OMPRESSION I MPLEMENTATION

4.2.4 Index Buffer Approach

Beside the FIFO approach, we use 3 buffer slots to store instructions and we use two stages to implement instruction decompression. The main idea is that if we can choice the source of output instructions, we do not need to move buffers. Thus we reduce the area cost and moving overhead. In figure 4.6, there are 2 stages before the decode stage. In the following, when we indicate something is x bits length, it means x data bits with 2x wires due to the dual-rail implementation.

The first stage (called PF stage) gets instructions (64bits) from instruction memory and puts it into proper buffer according to valid bits (3bits). Valid bits are used to indicate each buffer is empty or not. The second stage (called DP stage) dispatches instructions into the expected sequence.

We load instructions from memory into corresponding buffer with sequence:

buf0->buf1->buf2->buf0…. , we use index 2bits to indicate the positions. In fact, to put instructions into proper buffer does not reference index. We can make determination only by valid bits. We just read index in the first stage and pass to the next stage by latch. Index bits are reference by DP stage.

Figure 4.6：Index buffer approach

Figure 4.7：Initial status

We will give an example to show how the mechanism works as illustrated in figure 4.7 and figure 4.8. In figure 4.7, instruction 6 and instruction 7 can be parallelly executed and

Instruction Memory

the other can not. What we do is to dispatch the expected instruction sequence before decode stage. The buffer is written by PF stage and is read by PF stage. The tag bits are written by DP stage and are read by PF stage. We assume the valid token and empty token are interlaced between stages in the following figure 4.8(a). To illustrate the instance, we named each token handling step.

Figure 4.8：(a) Continuous single-handled instruction – part I Buf0

Figure 4.8：(b) Continuous single-handled instruction – part II

In step1, the main job is to load instruction memory with pc address. To choice the target position, it read valid bits to reference. Finally, valid bits, index bits and pc are put into the PF-latch. In step2, the value which is in the PF-latch passes to the next stage. The index = 0, so we load instruction from buf0 and buf1. Then we checked the parallel bit of the loaded instruction (buf0) is 0. So we do not take the value of buf1 but insert NOP (0x00000000) to instead. Because we take one instruction from buffer, the index plus one

Buf0

and write into index register. Since the original valid value is 3’b000, we know we would load new one in the previous step and change it into 3’b011 (v2v1v0). Furthermore, we load one instruction from buffer in this step, change it into 3’b010 and write back to the valid register. In step 4, there are only one empty buffer slot after taking one instruction. So the pc value will not be changed. It cause the buffer content will not be changed in the next

In figure 4.8 (b), the step 5 does not load data from instruction memory either put into the buffer since the buffer has no enough space. Therefore, this step is fast to pass than step 1 or step 3.

In figure 4.8 (c), the difference is in step 12. Instruction 6 and instruction 7 can be parallelly executed. Because the parallel bit of instruction 6 is 1, we also load the next instruction ( (index + 1) % 3 = 2’b00 ). To reduce unnecessary overhead, instruction 6 and instruction 7 only can be placed in sequence as figure. The result is not different between Ins.6 | Ins.7 or Ins.7 | Ins.6. In the example, the single-handed instruction is always placed in the left side. However, no matter the instructions are single-handed or not, 3 buffer slots is enough to store them without 2 times memory access.

As we mentioned before, the dispatch stage writes new valid bits and index bits into registers, and the value is referenced by next step. We renew the registers in two steps.

Table 4.1 shows the valid bits changing rule. For example, if the current instruction is single-handed (p-bit = 0) and the valid bit (from PF-latch) is 3’b100 the new valid bits will be 3’b011. Because we know we would load instructions into 2 buffers in the previous step and we will take one instruction from buffer. However, v2v1v0 will not be 3’b111 due to the dispatch will consume one instruction at least. By the way, to reference index bits here is unnecessary. Table 4.2 shows the index bits changing rule. Index = (index + 1 + p)%3.

However, when tag = 3’b000 it will be reset to 2’b00. The purpose of reset is to maintain the same rule especially the branch happened, we let the mechanism starts from buf0 when the slot are all empty. Furthermore, L-bits is modified by exe1 stage.

Table 4.1 : Valid bits changing rule p-bit

Table 4.2 : Index bits changing rule P

In fact, we have the instruction rules to decide the function unit in compile time. Not all the instructions can be executed in each function unit but almost. We can see the op code in table 4.3, the memory related instructions can only be executed in MEM function unit.

Since if these instructions can be parallelly executed, the memory read/write port will increase and we need to handle the extra dependency problem. This rule makes the circuit much simple.

When the instruction is single-handed, we only compare the first bit and second bit so that we can select the entry of function units. When the bits = 2’b11, the instruction will be put in the right side for MEM function unit. The others are put in the left side for MAC function unit even it also can be executed in the right side. Besides, there is another benefit.

For the single-handed instructions, they can be executed in each MAC or MEM function

在文檔中非同步雙道超大指令字組處理器之指令壓縮設計 (頁 30-36)