RELATED WORKS - 非同步雙道超大指令字組處理器之指令壓縮設計

Since RISC VLIW processor has fixed instruction length. In most time, instruction packet is not full utilization. It means that there are many useless NOP instructions occupy the space in instruction packet. But it still needs to be put into instruction memory which causes waste. Moreover, to cut down the program code size implicitly can reduce the communication times between instruction memory and CPU. Thus we can complete a program much quickly. For these reasons, there are numerous studies to deal with the problem what we called instruction compression. We will introduce some general methods briefly in the following.

3.1 Common Instruction Compression

Although instruction compression can reduce wasting memory, it increase hardware cost. Therefore, it is a trade off between memory space and hardware cost. In academic research, they even employ data compression technology with expensive cost. However, commercial processors usually take easy ways.

3.1.1 Data Encoding Scheme

Because the program is also composed of binary code (machine code), some people apply data encoding scheme to instruction compression. The basic idea is to replace original code with fewer bits. The compression ratio is decided by code repeating times. Obviously, it needs to create table which stores the corresponding relation. LZW [8] and Huffman coding are two common data compression methods. In many related papers, instruction compression is based on them [9].

Figure 3.1：(a) LZW compressed algorithm

Figure 3.1：(b) LZW decompressed algorithm STRING = get input character

WHILE there are still input characters DO CHARACTER = get input character

IF STRING + CHARACTER is in the string table then STRING = STRING + CHARACTER

ELSE

Output the code for STRING

Add STRING + CHARACTER to the string table STRING = CHARACTER

END of IF END of WHILE

Output the code for STRING

Read OLD_CODE Output OLD_CODE

WHILE there are still input characters DO Read NEW_CODE

IF NEW_CODE is not in the translation table THEN STRING = get translation of OLD_CODE STRING = STRING + CHARACTER ELSE

STRING = get translation of NEW_CODE END of IF

Output STRING

CHARACTER = first character in STRING

add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE

END of WHILE

LZW:

Lempel Ziv proposed a data compression approach in 1977. Terry Welch refined it later in 1984. The LZW algorithm is shown as following figure 3.1 (a)(b). Like many data compression methods, LZW algorithm looks up translation table as dictionary. But both compression and decompression create translation table respectively. It means that doing decompression just need encoding data without table information, the table will be created automatically. Thus the translation table has dynamic contents and the contents are added step by step. That is why the LZW algorithm suits instruction compression.

Huffman:

We can also apply Huffman scheme to instruction format. The main idea is to analyze the instruction components of some specific application program by benchmark. For example, if we want to compress op code we give the often used instruction type fewer bits and give the seldom used instruction type more bits.

Most data compression methods will let instruction length be variable. In general, these methods are trade-off between compression ratio and area overhead. Besides the expensive cost of data compression methods, commercial products focus on balance. The following approach is another idea with fixed instruction length. The idea is basic on characteristic of VLIW architecture.

3.1.1 Commercial Products

The VelociTI architecture is very long instruction word (VLIW) architecture [10]

which is developed by the company of Texas Instruments. Unlike academic research, commercial processor focus on both code size and hardware area cost.

The main idea of instruction compression is to eliminate NOP. The mechanism needs to coordinate instruction format. They use last 1-bit of instruction which is called p-bit to

indicate that the instruction is the end of instruction packet whether or not. Therefore it will reduce instruction maximal count. Figure 3.2 (a) shows how it works efficiently in best case.

In this case, VLIW architecture is 0% utilization because each instruction packet can only use one function unit. The VLIW processor works like general processor. We can see that useless NOP is 7 times versus valid instruction in each instruction packet. These NOP will not be stored in VelociTI instruction packet so that we compress a great deal of code size.

Figure 3.3 (b) shows the worst case and there is nothing happened. The new instruction packet format is same as original one. In general program, the code is impossible to be the form of fully parallel fetch packet. Basically, this mechanism saves plenty space decided by NOP amount.

Figure 3.2：(a) VelociTI instruction packet (best case) Ins.1 nop nop nop nop nop nop nop

Ins.1 Ins.2 Ins.3 Ins.4 Ins.5 Ins.6 Ins.7 Ins.8

0 0 0 0 0 0 0 0

Figure 3.2：(b) VelociTI instruction packet (worst case)

This method has 50% compression ratio for two-way VLIW architecture under ideal length. They expand several 16-bits instructions into original instruction set. These instructions are the copies of original instructions but with shorter instruction formats. Thus it can use these instructions to instead original one when the original instructions do not really need 32-bit length to represent operations. For MIPS, it has the similar way called MIPS16. The way of shorter instruction format will let processor waste time to change mode. For ARC, they have the similar way with mix instruction set. The processors do not need to change mode since the instruction set are variable length.

There are still many special ways to deal with VLIW compression [12]. The instruction with variable length is popular in these years. In [13], it discusses about tradeoffs between decompression overhead and compression ratio. We can see that using more bits only improves compression ratio a little bit. However, we do not take complex method by the reason of double cost (dual-rail encoding scheme) for our asynchronous processor.

Ins.1 Ins.2 Ins.3 Ins.4 Ins.5 Ins.6 Ins.7 Ins.8

1 1 1 1 1 1 1 0

Instruction

P-bit 8-instruction length

8-instruction length Ins.1 Ins.2 Ins.3 Ins.4 Ins.5 Ins.6 Ins.7 Ins.8

Fully parallel fetch packet

Ins1 || Ins2 || Ins3 || Ins4 || Ins5 || Ins6 || Ins7 || Ins8

Ins.1, Ins2, Total: 1 instruction packet

Chapter 4 Design for Asynchronous VLIW

在文檔中非同步雙道超大指令字組處理器之指令壓縮設計 (頁 20-25)