Memory Interface - A SYNCHRONOUS C IRCUITS

CHAPTER 2 BACKGROUND

2.1 A SYNCHRONOUS C IRCUITS

2.1.5 Memory Interface

We use synchronous memory as instruction memory due to the expensive cost of asynchronous memory. Thus we need the memory interface between synchronous and asynchronous circuits as shown in figure 2.6. The multi-input C-element is also replaced by alternative one as figure 2.5 actually.

……

Figure 2.6：Instruction memory interface

2.1.5 Bypass Data Path

In asynchronous design style, the one of advantages is low power requirement. We gave the example before and the following figure 2.7 shows how it really happened. In this figure, we let data pass through just one function block every time. To the other function blocks, they just pass empty (all data bits are 0) token. The irrelative portions do not waste power obviously. Moreover, if we receive NOP instruction we will just pass it to the next stage without unnecessary calculation. It will be fast and consume less power. It is a practical application about asynchronous design concept.

Instruction memory

addr[0] …… addr[12] val[0] …… val[63]

Prefetch stage C

addr’[0] …… addr’[12] val’[0] ……

a’[0].t a’[0].f a’[12].t a’[12].f

…

v’[0].t v’[0].f

val’[63]

……

delay

Figure 2.7：Bypass Data Path Data input

Function block2 Function

block1

Function blockN Control signal

Data output

...

DeMUX

Chapter 3 Related Works

Since RISC VLIW processor has fixed instruction length. In most time, instruction packet is not full utilization. It means that there are many useless NOP instructions occupy the space in instruction packet. But it still needs to be put into instruction memory which causes waste. Moreover, to cut down the program code size implicitly can reduce the communication times between instruction memory and CPU. Thus we can complete a program much quickly. For these reasons, there are numerous studies to deal with the problem what we called instruction compression. We will introduce some general methods briefly in the following.

3.1 Common Instruction Compression

Although instruction compression can reduce wasting memory, it increase hardware cost. Therefore, it is a trade off between memory space and hardware cost. In academic research, they even employ data compression technology with expensive cost. However, commercial processors usually take easy ways.

3.1.1 Data Encoding Scheme

Because the program is also composed of binary code (machine code), some people apply data encoding scheme to instruction compression. The basic idea is to replace original code with fewer bits. The compression ratio is decided by code repeating times. Obviously, it needs to create table which stores the corresponding relation. LZW [8] and Huffman coding are two common data compression methods. In many related papers, instruction compression is based on them [9].

Figure 3.1：(a) LZW compressed algorithm

Figure 3.1：(b) LZW decompressed algorithm STRING = get input character

WHILE there are still input characters DO CHARACTER = get input character

IF STRING + CHARACTER is in the string table then STRING = STRING + CHARACTER

ELSE

Output the code for STRING

Add STRING + CHARACTER to the string table STRING = CHARACTER

END of IF END of WHILE

Output the code for STRING

Read OLD_CODE Output OLD_CODE

WHILE there are still input characters DO Read NEW_CODE

IF NEW_CODE is not in the translation table THEN STRING = get translation of OLD_CODE STRING = STRING + CHARACTER ELSE

STRING = get translation of NEW_CODE END of IF

Output STRING

CHARACTER = first character in STRING

add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE

END of WHILE

LZW:

Lempel Ziv proposed a data compression approach in 1977. Terry Welch refined it later in 1984. The LZW algorithm is shown as following figure 3.1 (a)(b). Like many data compression methods, LZW algorithm looks up translation table as dictionary. But both compression and decompression create translation table respectively. It means that doing decompression just need encoding data without table information, the table will be created automatically. Thus the translation table has dynamic contents and the contents are added step by step. That is why the LZW algorithm suits instruction compression.

Huffman:

We can also apply Huffman scheme to instruction format. The main idea is to analyze the instruction components of some specific application program by benchmark. For example, if we want to compress op code we give the often used instruction type fewer bits and give the seldom used instruction type more bits.

Most data compression methods will let instruction length be variable. In general, these methods are trade-off between compression ratio and area overhead. Besides the expensive cost of data compression methods, commercial products focus on balance. The following approach is another idea with fixed instruction length. The idea is basic on characteristic of VLIW architecture.

3.1.1 Commercial Products

The VelociTI architecture is very long instruction word (VLIW) architecture [10]

which is developed by the company of Texas Instruments. Unlike academic research, commercial processor focus on both code size and hardware area cost.

The main idea of instruction compression is to eliminate NOP. The mechanism needs to coordinate instruction format. They use last 1-bit of instruction which is called p-bit to

indicate that the instruction is the end of instruction packet whether or not. Therefore it will reduce instruction maximal count. Figure 3.2 (a) shows how it works efficiently in best case.

In this case, VLIW architecture is 0% utilization because each instruction packet can only use one function unit. The VLIW processor works like general processor. We can see that useless NOP is 7 times versus valid instruction in each instruction packet. These NOP will not be stored in VelociTI instruction packet so that we compress a great deal of code size.

Figure 3.3 (b) shows the worst case and there is nothing happened. The new instruction packet format is same as original one. In general program, the code is impossible to be the form of fully parallel fetch packet. Basically, this mechanism saves plenty space decided by NOP amount.

Figure 3.2：(a) VelociTI instruction packet (best case) Ins.1 nop nop nop nop nop nop nop

Ins.1 Ins.2 Ins.3 Ins.4 Ins.5 Ins.6 Ins.7 Ins.8

0 0 0 0 0 0 0 0

Figure 3.2：(b) VelociTI instruction packet (worst case)

This method has 50% compression ratio for two-way VLIW architecture under ideal length. They expand several 16-bits instructions into original instruction set. These instructions are the copies of original instructions but with shorter instruction formats. Thus it can use these instructions to instead original one when the original instructions do not really need 32-bit length to represent operations. For MIPS, it has the similar way called MIPS16. The way of shorter instruction format will let processor waste time to change mode. For ARC, they have the similar way with mix instruction set. The processors do not need to change mode since the instruction set are variable length.

There are still many special ways to deal with VLIW compression [12]. The instruction with variable length is popular in these years. In [13], it discusses about tradeoffs between decompression overhead and compression ratio. We can see that using more bits only improves compression ratio a little bit. However, we do not take complex method by the reason of double cost (dual-rail encoding scheme) for our asynchronous processor.

Ins.1 Ins.2 Ins.3 Ins.4 Ins.5 Ins.6 Ins.7 Ins.8

1 1 1 1 1 1 1 0

Instruction

P-bit 8-instruction length

8-instruction length Ins.1 Ins.2 Ins.3 Ins.4 Ins.5 Ins.6 Ins.7 Ins.8

Fully parallel fetch packet

Ins1 || Ins2 || Ins3 || Ins4 || Ins5 || Ins6 || Ins7 || Ins8

Ins.1, Ins2, Total: 1 instruction packet

Chapter 4 Design for Asynchronous VLIW processor

In this chapter, we will introduce our asynchronous VLIW processor in the beginning.

Then we will show how we implement instruction compression mechanism into this processor in first two stages. We also give instruction set some simple rules to reduce cost.

Furthermore, we design a special approach to deal with pc relative instructions. This branch handling method can improve efficient about 50% compared with inserting NOP.

4.1 Processor Architecture

This asynchronous processor is a two-way VLIW architecture. The handshaking protocol is four-phase signaling with dual-rail encoding. So we adopt Muller pipeline as we expounded in chapter 2. There are 6 pipeline stages as shown in figure 4.1.

Figure 4.1：Asynchronous two-way VLIW architecture Prefetch

Dispatch

MAC decode

MAC Exe1

WriteBack Instruction Memory

Interface

Sync <-> Async

MEM decode

MEM Exe1

MAC Exe2 MEM Exe2 Data

Memory Interface Sync <-> Async

Register Bank

4.2 Compression implementation

Since we design the asynchronous VLIW processor, we also need to consider saving instruction memory. In chapter3, we already introduced some methods about instruction memory. However, we adopt the VelociTI approach for three reasons.

First, it is simple than the others especially in asynchronous circuits. We know that it is hard to design complexity asynchronous circuits with acceptable cost. It is caused by handshaking protocol especially in four-phase dual-rail protocol. Moreover, if we adopt the complex data compression, it is really hard to design the correct asynchronous circuits without timing constraints. Second, we can take the disadvantage of losing one bit in an instruction format. Finally, simple way gives the processor operates efficiently and the complex one is not worth to do it.

Our processor is a 32-bit RISC processor. Each instruction keeps last one bit for parallel bit. The parallel bit = 1 means it can be parallelly executed with next instruction in two function units. Otherwise, it can only be executed in just one function unit. Although we eliminate the NOP instruction, NOP instruction is still allowed in instruction packet. For example, we can assign a NOP instruction packet like 0x0000000100000000 or 0x00000000. There is no difference between them.

4.2.1 Stage Timing Issues on Asynchronous Circuits

Because there is no clock signal to cooperate with each stage, the time to pass stage is not constant. The adjacent instructions may not be located on corresponding stages. Figure 4.2 shows all possible situations. In this example, the instruction sequence is SLL, ADD and SUB. Case (a) is ideal status with interlaced format that instructions are divided by exactly one bubble; Case (b) and (c) are another situation that add instruction runs fast. In the other hands, we have no idea to know or even predict every stage’s status. It must be take care

when we deal with signal across multiple stages.

Figure 4.2：Unpredictable instruction executing time

4.2.2 PF and DP Stage

To deal with instruction decompression, the relative information and instruction buffer is necessary. The buffers are made by registers. Here is the problem about read/write timing.

To insure work correctly, our main idea is to separate register read/write into adjacent stage;

thus we can promise instruction sequence since there is at least one bubble between two instructions. We use two stages (Prefetch and Dispatch) to complete instruction fetch mechanism. However, if we read/write same register in one stage, we still need to consider read/write timing and the possible solution is to use latch. It is similar way to complete it with two stages.

4.2.3 FIFO Approach

In synchronous processor, it is familiar to use FIFO storing tag and buffer as shown in figure 4.3. This mechanism needs at least three steps: load memory, move FIFO and send output. Figure 4.3 is the example of 2-way VLIW and use 4 buffer slots to store instructions.

Buf1 and buf2 load from memory and move left to buf3 and buf4. The buf3 and buf4 are waiting for executing. There are two cases: if the instruction packet in buf3 and buf4 can be parallelly executed then send it into dispatch unit and move buf1 and buf2 into buf3 and buf4; if the instruction packet in buf3 and buf4 should be executed along then send one instruction into dispatch unit and rearrange buffer FIFO.

Figure 4.3： FIFO approach

Unfortunately, some case will cause pipeline stalled as illustrated in figure 4.4. Buf1 and buf2 always fetch 2 instructions from memory if they are empty slots. In (b), it does not load from memory. It causes the later step (d) is not ready to send output and the pipeline is stalled. In this example, we assumed sending output to clean slots first, then moving FIFO and loading new instructions. If we exchange the sequence into 1.move FIFO, 2.load new instructions, 3.send output, then the pipeline is still stalled as shown in figure 4.5.

Buf4 Buf3 Buf2 Buf1

Figure 4.4：Pipeline Stall - 1

Figure 4.5：Pipeline Stall - 2

Buf4 Inst1 Buf3 Inst2 Buf2 Buf1 Buf4 Buf3 Buf2 Buf1

The 3 steps are necessary and whatever you changed the sequence, the stalled situation is still happened. The reason is that the FIFO is not depth enough. However, to increase FIFO depth means increasing overhead. Furthermore, you can notice that it is not really simple for such FIFO mechanism. Actually, it still has another stalled situation. It can move one or more slots to solve the problem [14]. In [14], we can see the FIFO approach has lots overhead with MUX everywhere. However, we proposed the index buffer approach to reduce cost and make it fast.

4.2.4 Index Buffer Approach

Beside the FIFO approach, we use 3 buffer slots to store instructions and we use two stages to implement instruction decompression. The main idea is that if we can choice the source of output instructions, we do not need to move buffers. Thus we reduce the area cost and moving overhead. In figure 4.6, there are 2 stages before the decode stage. In the following, when we indicate something is x bits length, it means x data bits with 2x wires due to the dual-rail implementation.

The first stage (called PF stage) gets instructions (64bits) from instruction memory and puts it into proper buffer according to valid bits (3bits). Valid bits are used to indicate each buffer is empty or not. The second stage (called DP stage) dispatches instructions into the expected sequence.

We load instructions from memory into corresponding buffer with sequence:

buf0->buf1->buf2->buf0…. , we use index 2bits to indicate the positions. In fact, to put instructions into proper buffer does not reference index. We can make determination only by valid bits. We just read index in the first stage and pass to the next stage by latch. Index bits are reference by DP stage.

Figure 4.6：Index buffer approach

Figure 4.7：Initial status

We will give an example to show how the mechanism works as illustrated in figure 4.7 and figure 4.8. In figure 4.7, instruction 6 and instruction 7 can be parallelly executed and

Instruction Memory

the other can not. What we do is to dispatch the expected instruction sequence before decode stage. The buffer is written by PF stage and is read by PF stage. The tag bits are written by DP stage and are read by PF stage. We assume the valid token and empty token are interlaced between stages in the following figure 4.8(a). To illustrate the instance, we named each token handling step.

Figure 4.8：(a) Continuous single-handled instruction – part I Buf0

Figure 4.8：(b) Continuous single-handled instruction – part II

In step1, the main job is to load instruction memory with pc address. To choice the target position, it read valid bits to reference. Finally, valid bits, index bits and pc are put into the PF-latch. In step2, the value which is in the PF-latch passes to the next stage. The index = 0, so we load instruction from buf0 and buf1. Then we checked the parallel bit of the loaded instruction (buf0) is 0. So we do not take the value of buf1 but insert NOP (0x00000000) to instead. Because we take one instruction from buffer, the index plus one

Buf0

and write into index register. Since the original valid value is 3’b000, we know we would load new one in the previous step and change it into 3’b011 (v2v1v0). Furthermore, we load one instruction from buffer in this step, change it into 3’b010 and write back to the valid register. In step 4, there are only one empty buffer slot after taking one instruction. So the pc value will not be changed. It cause the buffer content will not be changed in the next

In figure 4.8 (b), the step 5 does not load data from instruction memory either put into the buffer since the buffer has no enough space. Therefore, this step is fast to pass than step 1 or step 3.

In figure 4.8 (c), the difference is in step 12. Instruction 6 and instruction 7 can be parallelly executed. Because the parallel bit of instruction 6 is 1, we also load the next instruction ( (index + 1) % 3 = 2’b00 ). To reduce unnecessary overhead, instruction 6 and instruction 7 only can be placed in sequence as figure. The result is not different between Ins.6 | Ins.7 or Ins.7 | Ins.6. In the example, the single-handed instruction is always placed in the left side. However, no matter the instructions are single-handed or not, 3 buffer slots is enough to store them without 2 times memory access.

As we mentioned before, the dispatch stage writes new valid bits and index bits into registers, and the value is referenced by next step. We renew the registers in two steps.

Table 4.1 shows the valid bits changing rule. For example, if the current instruction is single-handed (p-bit = 0) and the valid bit (from PF-latch) is 3’b100 the new valid bits will be 3’b011. Because we know we would load instructions into 2 buffers in the previous step and we will take one instruction from buffer. However, v2v1v0 will not be 3’b111 due to the dispatch will consume one instruction at least. By the way, to reference index bits here is unnecessary. Table 4.2 shows the index bits changing rule. Index = (index + 1 + p)%3.

However, when tag = 3’b000 it will be reset to 2’b00. The purpose of reset is to maintain the same rule especially the branch happened, we let the mechanism starts from buf0 when the slot are all empty. Furthermore, L-bits is modified by exe1 stage.

Table 4.1 : Valid bits changing rule p-bit

Table 4.2 : Index bits changing rule P

In fact, we have the instruction rules to decide the function unit in compile time. Not all the instructions can be executed in each function unit but almost. We can see the op code in table 4.3, the memory related instructions can only be executed in MEM function unit.

Since if these instructions can be parallelly executed, the memory read/write port will increase and we need to handle the extra dependency problem. This rule makes the circuit much simple.

When the instruction is single-handed, we only compare the first bit and second bit so that we can select the entry of function units. When the bits = 2’b11, the instruction will be put in the right side for MEM function unit. The others are put in the left side for MAC function unit even it also can be executed in the right side. Besides, there is another benefit.

For the single-handed instructions, they can be executed in each MAC or MEM function

unit. Most of them run fast in the MAC function unit by the lower gate delay of data path.

Furthermore we show the possible instruction input types in table 4.4.

Table 4.3 : The op code of 2-way VLIW asynchronous processor

29-27

31-30 000 001 010 011 100 101 110 111

00 ^NOP ^{R type} ^ADDI ^ADDIU ^SUBI ^XORI ^WAIT ^INTERRUPT

01 ^J ^SLTI ^RETURN ^CALL ^BEQ ^BNEQ ^REPB

10 ^ANDI ^ORI ACCLDH ACCLDL

11 ^LW ^LH ^LL ^LDW ^SW ^SH ^SL ^SDW

Table 4.4 : Possible instruction inputs combination

Input from buffers Dispatch output (MAC/MEM)

Fn1 Fn1 | NOP

Fn1 | Fn1’ Fn1 | Fn1’

Fn1’ | Fn1 Fn1’ | Fn1

Fn1 | Fn2 Fn1 | Fn2

Fn2 NOP | Fn2

In table 4.4, fn1 means the instruction can be executed in function unit 1 or both two

在文檔中非同步雙道超大指令字組處理器之指令壓縮設計 (頁 17-0)