Organization of the Thesis - 在超長指令集程式壓縮上，根據利益評估建造分離式字典

Chapter 1 Introduction

1.4 Organization of the Thesis

This thesis is divided as follows. Chapter 2 discusses the factor in VLIW and dictionary-based code compression. In chapter 3 we describe our code compression algorithm based on Chapter2 code compression method. The experimental environment and benchmark suite are described in Chapter 4. Our experimental results and relative analysis are also presented in chapter 4. Then, we summarize our

conclusions and future works in Chapter 5.

Chapter 2 Background

We introduce dictionary-based code compression in this Chapter. First, we introduce what is dictionary-based. Replacing the most frequent sequences with smaller codeword achieves code compression. Then partitioned dictionary-based method is well done in the VLIW code compression. It divide dictionary into two, one for opcode dictionary and the other for operand dictionary. In order to take advantage of repetitions of VLIW line sequences, dictionary entry with sequential access ability is adopted.

2.1 VLIW code compression

Variable-to-fixed (V2F) VLIW code compression [6] is a VLIW code

compression scheme that use variable-to-fixed (V2F) length coding. It also proposes an instruction bus encoding scheme, which can electively reduce the bus power consumption. It shows that the compression ratios using memoryless V2F coding for IA-64 and TMS320C6x are around 72.7% and 82.5% respectively. Markov V2F coding can achieve better compression ratio up to 56% and 70% for IA-64 and TMS320C6x respectively. The length of codeword on V2F VLIW code compression is still various. Decompressing one VLIW line in one cycle can not be ensured. If we want to speedup its decompression, more ROM and decompressing logic is needed.

Modern VLIW ISAs adapt a VLES (various length execution set) scheme to achieve high code density. But this size of fetch bundle on IA64 or TI series are fixed, and it means dictionary-based method is still flexible to modern VLIW code

compression. But convention Dictionary-based code compression scheme on RISC machines are still needed some changes to suit to VLIW compression like adding extra dictionary output port.

Considered compression ratio and decompress architecture, V2F isn’t on an advantageous position. And fetch bundle which is fixed length will be various after V2F. If V2F take instruction group as its compressing element, the existence of bundle is not needed and we need to redesign whole instruction set. Our research is the code compression application using dictionary-based scheme.

2.2 Dictionary-based Method

Dictionary-based compression methods attempt to find out common sequences of characters and replace them with a single codeword. Codeword is a basic element of compressed code. It includes Tag (represent compressed code or non-compressed code), Index (represent which dictionary entry is quarried) and else. To improving quarried frequency of entry is the most important thing. But not whole program will be compressed; some programs which just appear once are no need to be compressed.

The reduction of code size is achieved if the code sequence in the dictionary appears more than once and can be replaced by a codeword that is smaller than the size of this code sequence.

Addi r1, r6, 1

Addi r7 ,r9,4 Subi……

Addi r1, r6, 1 Subi …

Addi r7 ,r9,4 Subi……

Addi r1, r6, 1 Subi …

….

Sub r4,r5,r6 NOP …..

Addi r7 ,r9,4 Subi ……

….

Figure 2-1: Dictionary-base compression example.

In Figure left part is program before compressing. If right dictionary have this bit pattern, we can replace code with codeword which has dictionary entry index. Some codes which is the part of (…..) cann’t be inquired by dictionary. These codes will not be compressed and still exist in program in original type or add some tag to present un-compressed state.

Dictionary decompression uses a codeword as an index into the dictionary table, and then inserts the dictionary entry into the decompressed code sequence. If

codeword are aligned with machine words, the dictionary lookup is a constant time operation. Sometimes, in order to get more compression space, use Huffman encoding and MPEG-2 VLC encoding to encode index, and variable length index is produced.

The general design for a compressed program processor is given in Figure

Compressed

Instruction Offset and Length

Uncompressed instruction

Dictionary Index Logic (convert codeword to dictionary offset and length)

Figure 2-2 Compressed program processor

Lefurgy et al.[8] proposed a dictionary-based compression method, which stores a copy of the whole 32-bit instruction sequence, which appears frequently in the program, into the dictionary and replaces the occurrences of the sequence with shorter (fixed or variable-length) codeword. The average compression ratios of 61%, 66%, and 74% were reported for the PowerPC, ARM, and i386 processors respectively.

Wolfe et al. proposed a Huffman-encoding compression method in Compressed Code RISC Processor (CCRP). Each 32-byte cache line is compressed into smaller aligned bytes or words.

The compressed code size will have three parts as follows:

1. Compressed size: After compression, most original program will be replaced by codeword and some program which don’t be compressed: Not all code can be compressed. So compressed size includes codeword size and un-compressed code size.

2. Dictionary Size: We need dictionary to achieve compress code 3. According above, the estimate factor is:

Compression ratio = Compressed Size +Dictionary size / Original Size Compressed Size = Codes which could not be compressed + codeword size

2.3 Partitioned Dictionary

Improving Dictionary-Based Code Compression in VLIW Architectures [Sang-Joon NAM 1999][7], which divides one VLIW line into two groups, one for opcode group and the other for operand group In Figure two indexes are used to indicate different dictionary. Frequent-used VLIW lines are extracted from the

original code to be mapped into two dictionaries, an opcode dictionary and an operand dictionary. An average code compression ratio is 63%. In program, OP[opcode] or

OPD[operand] part have more repetition opportunities than whole VLIW line. Maybe we can proceed to find out other way to split dictionary but this paper shows OP and OPD is a good way to split.

Their algorithm has 2 steps as follows.

1. Building entries of two dictionaries

Building a dictionary that can achieve maximum compression is known as an NP-complete problem. This code compression scheme replaces an instruction word by an opcode sequence and an operand sequence, and limits their total bit-width to be the same as that normal operation. Thus, the maximum

compression problem is changed from NP-complete problem to a simple greedy one.

2. Replacing instruction words with the opcode dictionary index and operand dictionary index

The occurrence of each opcode and operand in the entry of two dictionaries is simply represented by fixed length opcode index and operand index. Specifically, the total bit width as required for the opcode index and operand index is made equal to that of an uncompressed operation in order to align the compressed VLIW line with the cache boundary. This can result in worse compression than a variable-length opcode index and operand index encoding, but makes instruction-fetching and decoding mechanism simple and fast. In general, variable-length encoding methods such as Huffman encoding are expensive to decode.

…. Entry (Size : OPD)

Operand dictionary Entry (Size : OPD)

Operand dictionary

Figure 2-4 Compression using Operand factorization

In figure 2-4, one VLIW line is replaced by one codeword which has two indexes.

In partitioned-based dictionary code compression, each codeword must be added one extra index. But in program, we have few chances to find out two totally the same VLIW lines. Using program character, the part of opcode or operand in the program is similar to each other; we can get more chance to reuse dictionary entries.

Figure 2-5 Instruction fetch path in the proposed code compression scheme for VLIW processor-based systems

Compressed program include compressed code and un-compressed code. When decompressing, we will inquire dictionary to decode codeword which is a shorter bit string to present compressed code. If un-compressed code is fetched, it will take the

bypass to enter processor directly.

2.3 Codeword with length slot

In Figure length slot is added to codeword. We adopt the dictionary to implement multiple-length common sequence. Sequences can start at any entry of dictionary and end at any following entry. The codeword is used as an index into the dictionary entry originally. But we give the ability to access dictionary with multiple entries. Length slot is added to codeword. According to length slot, Codeword can access any entry and its followings sequentially. A codeword takes two arguments: index and length.

During decompression, the decompressor jumps to the point in the dictionary indicated by index and fetches length opcode or operands, and at next cycle

decompressor would depend on length slot to access next codeword or increase index automatically.

When compiling, VLIW compiler usually generates RISC code first. Then, depending on parallel rules, some instruction sequences will be combined to form VLIW line sequence. Some VLIW code optimization like loop unroll will repeat VLIW line sequences to enhance performance. Thus, the repetition of VLIW sequence is high. In order to use the advantage, we take dictionary entry with sequential ability.

Several sequential VLIW lines could be just compressed by one codeword.

By this way, the sum of codeword can be reduced.

Addi r7 ,r9,4 Sub r3,r4,r6 Ld r5,0(r1) ….

#003

NOP Addi r1,r1,1 NOP NOP

#004

Sub r4,r5,r6 Addi r1,r1,1 B label 0 NOP Addi r2 ,r10,1 Sub r3,r4,r6 Ld r5,0(r1)…

Entry (Size : One VLIW line )

Dictionary

#001

#002 index

Addi r7 ,r9,4 Sub r3,r4,r6 Ld r5,0(r1) ….

#003

NOP Addi r1,r1,1 NOP NOP

#004

Sub r4,r5,r6 Addi r1,r1,1 B label 0 NOP Addi r2 ,r10,1 Sub r3,r4,r6 Ld r5,0(r1)…

Entry (Size : One VLIW line )

Figure 2-6 Dictionary entries with sequential access ability

Sequence : ABCD 4

Sequence : CDF

Dictionary

3

Figure 2-7 Dictionary entries can be inquired by different sequences

Sequence can start at any entry in the dictionary. Sequence can end at any entry after start entry in the dictionary. Dictionary entry could be used efficiently. That can reduce repetition entry as possible. Small sequences can combine to a large one and it will be more beneficial.

2.4 Summary

Dictionary-based is an instruction format dependence method that can utilize the repetition and regularity of code more efficiently. To increase the dictionary entry quarried frequency, we take improving dictionary-based code compression. To take advantage of VLIW sequence, we let dictionary entry with sequential ability. We adopt partition dictionary to increase repetition of dictionary entry. We could have more opportunity to reuse entry, but extra index is needed. Dictionary entry with sequential ability could take advantage of frequency-used sequences. Several VLIW

dictionary entry could be used efficiently. But length slot is needed. Most research just take partial advantage: use greedy algorithm to find the most frequency sequences and sequentially put them into dictionary. We hope to take more advantage of this.

Chapter 3 Design

We mainly present a heuristic algorithm for building Dictionaries. Goal of

building dictionaries is present dictionaries for compression ratio. By benefit equation, we just allow most beneficial sequences exist in the dictionary. By combination, dictionary entry can be covered by different sequences to reduce repetition entry as possible. During building dictionary, compressed sequence in program is decided.

Given Environment

We have OP and OPD dictionaries, and they have more chance to be inquired in the same size. It is low frequency that we can find two the same VLIW lines in the program. But separating one VLIW line into OP and OPD can take well use of

dictionary entry. Compared with one dictionary, two dictionaries cost more space and logic but it gets worthy compression ratio.

We also adopt length slot is added to Codeword. It is commonly replacing compressing multiple with one codeword. Reduce the Number of codeword and Dictionary entry can be reused by different sequences.

For decoding efficiently, OP dictionary entry size is full OP of one VLIW line and the same as OPD. We add decompressor to decode codeword, and some

performance-constrain must keep. We hope that the minimum ability of decompressor is to decode one codeword per cycle. One dictionary port can output one dictionary entry per cycle. If one VLIW line is separated into two parts and both parts are placed

in the same dictionary, two output ports of dictionary are needed. We don’t assume dictionary hardware. Dictionary may be placed in the processor, memory, or some special hardware. So we don’t how many output ports which can supply. From conservative view, we assume that one OP dictionary entry size is total OP of one VLIW line. A sequence which length is two means this sequence has full OP or OPD of two VLIW line.

Base idea

The key idea of the building algorithm is to select the most frequent sequences, which should be inserted into the dictionary, and reduce redundancy as possible.

Our goal is to build an efficient dictionary. An efficient dictionary could use each dictionary entry as well as possible. Partitioned dictionary-based is used and it is well done on many compression methods. Depending on the high frequency of VLIW line sequence, we want sequences can start at any entry of dictionary and end at any following entry.

The task of determining an optimal dictionary for a given text is known to be an NP-complete in the size of the text. However, many heuristics have sprung up that find near optimal solutions to the problem, and most are quite similar. The modified algorithm proceeds as follows.

We have OP and OPD dictionaries, and compression ratio is affected by both.

When building dictionary, how to compress program will be decided. 1. We build one of each first and mark program which could be inquired in this dictionary. 2.

Depending on marked program, we build another one dictionary. 3. After building dictionary, we compress program depending on mark. Some un-marked program may have change to be compressed lucky.

3.1 Build the dictionaries

In Figure we separate all VLIW lines sequences into opcode sequences and operand sequences. And build each one of OP or OPD dictionaries first. OPD dictionary pressure is always bigger than OP dictionary. In ADSP-21535 DSP, about 6.7~10.5% OP bit pattern dominate all program Ops. About 15.2~21.7% OPD bit pattern dominate all program OPDs. Build one dictionary first, and mark

inquired-able sequences in the program. Another dictionary is built depending on marked program. Which dictionary should be built first will be discussed in Chapter 4.

Now if we build OP dictionary first, we have an equation to judge a sequence is beneficial or not. After defining benefit, sequences are chosen in benefit order one by one.

Opcode sequence Operand sequence

Copd

Figure 3-1 Separate VLIW line sequence.

Now we choose some sticks to make a new one

32 26 23.5

Sequences look likes sticks and each have different color combination and value

Build Dictionary

…..

Figure:3-2 After defining benefit, sequences are chosen in benefit order one by one.

3.1.1 How to judge a sequence’s benefit

We will use a benefit function to judge a sequence which should be inserted into dictionary or not. Benefit is that if the sequence is inserted into dictionary, how much memory requirement per dictionary entry which represents the sequence could be reduced.

First, we calculate how much memory requirement could be reduced by the Sequence. If it is an OP sequence, we just take care about OP in the program.

Reduce-able memory requirement by the sequence is Size of OP program inquired by the one–Size of index and offset which present those OP program after

compressing–Size of sequence. If another dictionary is built, we just calculate the program which is marked. OPD Benefit is

( OPD inquir ed sequenc e size OPD Codewo rd size ) Dictionary Cost

Occurence

−

∑ −

=

Then, in order to judge benefit between different lengths of sequences, we must take sequence length into consideration. So we add sequence length as denominator to calculate how much memory requirement can be reduced at per sequence length. But

in our dictionary, dictionary entry can be cover by different sequences. If partial of sequence are existed in the dictionary and can be reuse by this sequence, the benefit denominator of this sequence is just extra needed dictionary size. Benefit ratio is

( )

So there will exists OP sequence S in the dictionary, sequence S’ benefit = (Size of OP program inquired by the one–Size of index and offset which present those OP program after compressing–Size of sequence) / extra needed dictionary size.

program program after

compressing

means the effect of previous dictionary

Figure 3-3: Benefit equation.

Reduce-able memory requirement: If the sequence S exists in the dictionary, the size of program which could be inquired by dictionary entry is reduce-able. If

sequence S is ABC, reduce-able program size is the sum of total A, B and C in the OP part of program. This value shows that Max reduce-able memory requirement could be got by this sequence.

Codeword size: It is that how much codeword size is needed to represent

reduce-able program. If three dictionary entries A, B, C, exist in the dictionary and are not placed sequentially, it would need three codeword to represent sequence ABC in the program. But if three VLIW lines are placed sequentially, we just need one

codeword to represent ABC. Therefore, the number of sequences and its sub-sequence will affect the sum of codeword.

Sequence length: It means how many VLIW lines is presented by this sequence.

Long sequence has more chance to replace more sequence in the program, but it takes more dictionary size. To represent this situation, benefit must divide sequence length to generate correct value.

Benefit example is as follow:

f(x) is the repetition of length x. If the opcode sequence is ABCD, f(3) is the number of ABC and BCD in the program. We assume that f(4) = 2, f(3) =2, f(2)=0, f(1)=0, length= 4. So (Reduce-able Program size – Codeword size) =

Opcode size of one line *( f(4)*4 + f(3)*3 +f(2)*2+f(1)*1) – one codeword size * ( f(4) + f(3) +f(2) +f(1)). Then that value which is divided by sequence length is benefit of sequence ABCD.

3.1.2 Build Dictionary flow

Now, we have benefit equation to judge sequence. “Candidate set”: sequences in candidate set have changes to enter chosen set. “Chosen set”: sequences in chosen set will load into the dictionary.

Candidate

benefit and remove some impossible sequences 1.Create

candidate set

2.1 Choosing the beneficial one

Figure 3-4 The flow of choosing the most beneficial sequences

2.1 Select Most Beneficial Sequence (MBS) from Candidate set

3.Dictionary size is enough or needed sequence is fully chosen

2.2 Insert MBS into Chosen set

Check combine relation between MBS and the

sequences in the Chosen Set and Recursive combine

4. Remove MBS and its sub-sequence from candidate set and the benefit of Some candidate sequences should be re-count

Yes, over No

1.Count all possible sequences’ benefit and create candidate set

Figure 3-5 Flow of Choose algorithm

3.1.2.1 Create candidate sequence set

Candidate sequence set

Just pick up possible sequences

Figure 3-6: Create candidate sequences

If this dictionary is built first, we try all possible combination. If one

dictionary is built, other one’s candidate set just consider program which is marked by previous dictionary. The Goal of candidate set is reduce computing time. We remove sequences which don’t be considered, like sequence’s benefit is smaller than 0.We also limit Max candidate sequence length. It is easier to simulate and we will test length from 1 to 8.

3.1.2.2.1 Choose the benefit one into chosen set

15/4

Candidate sequence set

2.1 Choose the most beneficial one

Chosen set

Figure 3-7: Choose a beneficial one from candidate set

3.1.2.2.2 Combine relation

When one sequence is inserted into chosen set, it is a problem of repetition bit pattern. To avoid repetitions of bit pattern, we need combining sequences. By

combining, the repetition could be reduced as possible. In dictionary, sequence ABC and BCD could be replaced by just ABCD (size=4), not ABC and BCD (size=6)

Sequence

The one Sequence

in chose n

Figure 3-8: What is combine relation

If a new chosen sequence is created, the one will check combine relation with others sequences in chosen set until no possibility of combining.

Chosen set A Chosen set B

+

Figure 3-9 Recursive combine

3.1.2.3 Stop chosen set increasing

After combining, we check candidate and chosen sets’ size. If chosen set’s size

在文檔中在超長指令集程式壓縮上，根據利益評估建造分離式字典 (頁 9-0)