非同步雙道超大指令字組處理器之預解碼迴圈緩衝器設計

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

非同步雙道超大指令字組處理器之

預解碼迴圈緩衝器設計

Predecode Loop Buffer Design for Asynchronous two-way VLIW

Processor

研究生：陳建志

(2)

非同步雙道超大指令字組處理器之預解碼迴圈緩衝器設計

Predecode Loop Buffer Design for Asynchronous two-way VLIW Processor

研究生：陳建志 Student：Jian-Jhih Chen

指導教授：陳昌居 Advisor：Chang-Jiu Chen

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

September 2010

Hsinchu, Taiwan, Republic of China

(3)

非同步雙道超大指令字組處理器之預解碼迴圈

緩衝器設計

研究生：陳建志

指導教授：陳昌居教授

國立交通大學資訊科學與工程研究所

摘要

現今處理器採用的電路多為同步電路，也就是由時脈驅動的電路。時脈會造成時脈偏差的問題，非同步電路是一種沒有時脈的電路，所以非同步電路設計可以避免時脈產生的時脈偏差問題。迴圈緩衝器設計在現今的數位訊號處理器或者嵌入式處理器中，是一個非常普遍且實用的元件。迴圈緩衝器不僅能減少指令記憶體的存取次數，更能達到增進效能。然而，傳統的迴圈緩衝器中的迴圈偵測機制，卻往往造成一個經常執行的指令被執行頻率不高的指令替換因此本篇論文設計了一個新穎的迴圈偵測機制，不僅能避免執行頻率高的迴圈被替換，更能減少指令解碼單元部分運算。我們將預解碼迴圈緩衝器實現在一款無時脈的非同步雙道超大指令字組處理器。我們使用 Design Compiler 合成，預解碼迴圈緩衝器合成所產生的結果為面積 240000μm2 、延遲 7.41 ns.

(4)

ii

Predecode Loop Buffer Design for Asynchronous

two-way VLIW Processor

Student：Jian-Jhih Chen

Advisor：Dr. Chang-Jiu Chen

Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University

Abstract

Most modern processors are synchronous circuit that is triggered by clock. Clock would produce the problem that is clock skew. Asynchronous circuit is clockless and the problem of clock skew it can be avoided by handshake protocol. Loop buffer design is a common and useful unit for modern digital signal processors or embedded processors. It is not only could decrease the access times of instruction memory, but could improve performance.

However, traditional loop buffers always let the frequently executed instruction be replaced by another infrequently executed instruction. Thus, this thesis designs a novel loop detection that could avoid a frequently executed instruction be replaced and reduce the operations of instruction fetch and decode.

Predecode Loop Buffer will be implemented on asynchronous two-way VLIW processor. Predecode Loop Buffer is synthesized by Design Compiler. Its area is about

(5)

Acknowledgement

完成這篇論文，首先要感謝我的指導教授 – 陳昌居教授，感謝老師提供良好的研究環境與設備，以及這兩年在研究方面給了很多想法以及建議的鄭緯民博士、張元騰學長以及蔡宏岳學長，最後謝謝李國成，張瑞宏，鄒柏成，三位同學努力不懈的與我一起討論以及研究。

(6)

iv

LIST OF PICTURES

Fig 2-1 Bundled Data Channel ... 8

Fig 2-2 Four-Phase Handshake Protocol ... 8

Fig 2-3 Diagram of Dual-Rail Data Encoding ... 10

Fig 2-4 4-Phase Dual-Rail Handshake Protocol ... 10

Fig 2-5 Transfer Diagram ... 11

Fig 2-6 Muller C-element and its RTL Implementation ... 12

Fig 2-7 Muller C-element and its RTL Implementation with Reset ... 13

Fig 2-8 3-bits Alternative Complete Detection ... 13

Fig 2-9 Four-Phase Bundled Data Pipeline ... 14

Fig 2-10 Four-Phase Bundled Data Pipeline with Function Blocks ... 14

Fig 2-11 Four-Phase Dual-Rail Data Pipeline ... 14

Fig 2-12 Short Backward Branch Instruction ... 17

Fig 2-13 Count Based Scheme to Monitor Sbb Executions ... 17

Fig 2-14 Basic Blocks in Instruction Stream ... 19

Fig 2-15 Processor Pipeline with Decode Filter Cache... 22

Fig 3-1 The Pipeline Architecture of Asynchronous Two Way Processor ... 24

Fig 3-2 C_Latch ... 26

(9)

Fig 3-4 Branch Handling Architecture ... 27

Fig 3-5 The Combination of Valid Token and Empty Token ... 28

Fig 3-6 The Example of Program ... 30

Fig 3-7 The Block Diagram of Instruction Decode Stage ... 31

Fig 3-8 The Architecture of Predecode Loop Buffer ... 32

Fig 3-9 The DeMux and Merge ... 33

Fig 3-10 The Branch Information Table ... 34

Fig 3-11 The Mutual Exclusion ... 35

Fig 3-12 BIT Block Diagram ... 36

Fig 3-13 The Possible Transition of Mode ... 37

Fig 3-14 The Transition Phase and Program Phase ... 39

Fig 3-15 The Block Diagram of Mode Controller ... 40

Fig 3-16 A Frequent Loop in a Program ... 41

Fig 3-17 The Case of Instruction... 43

Fig 3-18 The Hardware Solves the Instruction with Same Address ... 43

Fig 3-19 Miss Recovery Mechanism ... 47

(10)

viii

LIST OF TABLES

Table 2-1 Dual-Rail Encoding Method... 9

Table 2-2 Truth Table of Muller C-element ... 11

Table 2-3 Power Dissipation in StrongARM ... 21

Table 3-1 The Transition Events of Mode ... 38

Table 3-2 The Relation between P-Bit and Change Bit ... 45

Table 3-3 The Relation between Change Bit and Index ... 45

Table 4-1 The Area of Basic Element in our Predecode Loop Buffer ... 48

Table 4-2 The Area Compares with Asynchronous Two Way VLIW Processor ... 48

(11)

Chapter 1 Introduction

In recent years, asynchronous circuit design becomes more and more popular.

Synchronous circuit has a serious problem, so it will become more difficult to design circuit.

The problem is clock skew that may become more and more serious, because the clock

frequency will be too high in future. Asynchronous circuit design is clockless circuit. In other

words, asynchronous circuit does not have clock and it can avoid the clock skew. Because

asynchronous circuit does not have clock, the modules of asynchronous circuit can work only

when the modules are needed to operate. Thus it has the potential for low power consumption.

However, the accessed times of memory would affect the performance of system. If the

accessed times of memory is too much, the performance will be obviously degraded. To

reduce the accessed times of memory, many architecture approaches try to take program

behavior into consideration. A lot of embedded applications are characterized by spending a

large fraction of execution time on program loops [1] [9]. In order to execute the loops, the

same operations will be repeated again and again, including instruction fetch and instruction

decode. In fact, the accessed times of instruction memory are usually more than the accessed

times of data memory. If we can reduce the accessed times of instruction memory, the

performance will may be improved.

Loop buffer is a common component to reduce the instruction memory accessed times,

(12)

2

executed times of loop is usually decided in the run-time, the software-based only deal with

that the executed times of loop is already decided in compiler time. Thus, the hardware-based

loop buffer is better than the software-based loop buffer.

.In this thesis, we propose a loop buffer to keep the more frequent loop and promise the

frequent loop that would not be replaced by other infrequent loops. The control signals of the

instructions are stored in the loop buffer and they can reduce same actions in fetch and decode

stages. We applied it on an asynchronous two-way VLIW processor and implement it with

Verilog HDL (hardware description language).

1.1 Motivation

The power consumption of instruction fetch from memory and decode instruction to

control signal is a critical issue in VLIW processor. Most programs have many loops that

composed of the non-branch instruction and the branch instruction. These loops have the

characteristic of temporal and spatial locality, so we could try to use the characteristic to

reduce unnecessary repeated operations of fetch and decode of loops. Finally, we designed our

loop buffer and implemented it on asynchronous dual-rail two-way VLIW processor.

1.2 The Organization of the Thesis

(13)

processor. The design uses an approach to select basic blocks for placement in the loop buffer.

The rest of the thesis is organized as follows. In chapter 2, the background of asynchronous

circuits and related works of loop buffer will be discussed. In chapter 3, we present the design

and implement of our asynchronous loop buffer. In chapter 4, the simulation result will be

(14)

4

Chapter 2 Related Works

This chapter will give introduction of asynchronous circuits, and discuss basic concept of

basic blocks about loop buffer. We will show the advantages, handshake protocol and data

encoding methods in chapter 2.1. Then we will introduce the Muller pipeline, C-element and

Complete Detection in chapter 2.2. In chapter 2.3, we will descript the two ways of loop

detection to implement the loop buffer: Count-Based Loop Detection [1][2] and Basic Blocks

Loop Detection [3]. We will explain why the problem will be produced by these loop

detections. Finally, we will introduce a method to reduce decode operations by caching

decoded instructions, Decoded Filter Cache [4].

2.1 Asynchronous circuits

Asynchronous circuits design is a circuit design methodology. It is practically different

from synchronous circuit. The major difference in the two ways of circuit design is clock.

Asynchronous circuit design is clockless circuit. In other words, the asynchronous circuit

works only when necessary; thus it has the potential for low power consumption. However, it

also has many advantages. It is quite difficult to design the asynchronous circuit by the two

reasons. First, asynchronous handshake protocol is more complex than synchronous. Second,

it has few CAD tool for asynchronous circuits design than synchronous, and it makes

(15)

2.1.1 Advantages and drawbacks

Comparing with the synchronous circuit design, the asynchronous circuit design has

no global clock and uses the handshake protocol between the individual module to

perform synchronizations and communications. Based on these conceptions,

asynchronous circuits design has many advantages over synchronous counterparts. The

followings are advantages of the asynchronous circuit design:

(1)No clock skew and clock distribution problem: Clock is an important role on

communication between synchronous modules. However, clock would produce a

serious trouble about clock skew. In addition, because systems will become larger and

larger, transferring clock signals will become more and more difficult than now.

Fortunately, because it is clockless, asynchronous circuits design does not need to

consider clock skew and clock distribution problem. Asynchronous circuit design

could deal with these problems by handshake protocols.

(2)Better modularity: As different sub-circuit of synchronous circuits operate at own

clock frequency, it is a difficult job to integrate these sub-circuits into a system.

Because asynchronous circuit design uses handshake protocols to replace the global

clock, designers only need to take care of the handshake protocols on how to

communicate and synchronize between the different modules. Designers do not need

(16)

6

(3)Low power consumption: Because of no clock signals, the asynchronous circuit

design does not provide extra power to generate clock tree. The individual modules

are demand-driven in asynchronous circuits design, but the individual modules in

synchronous circuit are clock driven. This means the modules in asynchronous

systems are active only when they needed, and they do not consume any extra power

for standby. It also means that asynchronous circuits design is no clock tree.

(4)Average-case performance: In synchronous circuits system, the slowest component

would decide the maximum speed of whole system. Even though it is infrequent to

operate, it is worst-case performance. The elasticity of asynchronous circuits design

has led to the outcome that an asynchronous circuit can perform average-case.

Because each asynchronous module and completes its own itself computations and

works, the data can be send immediately to the receiver. For this factor, asynchronous

circuits design can obviously accomplish average-case performance.

(5)Less electro-magnetic noises: Asynchronous circuit design has no clock distribution

networks, so it has less electro-magnetic noises.

Although asynchronous circuit design that compares with synchronous circuits

design has many advantages, it also has some disadvantages. One is few CAD tools to

support for designers so that designers are hard to implement the circuit. Another one is

(17)

of additional control signals. Asynchronous circuit design has many advantages and few

disadvantages. But comparing with advantages and disadvantages, asynchronous circuits

design still is good for designers to implement circuits.

2.1.2 4-Phase Handshake Protocol

Each module of asynchronous circuit communicates with other modules via

handshake protocols. In general, there are two parts in asynchronous circuits design. The

two parts are data signaling and control signaling. We can define these parts below:  Control Signaling: 2-phase, 4-phase etc.

 Data Signaling: 1-of-n encoding, dual-rail, bundled-data etc.

The data signaling has its own communicated method to transfer data between

sender and receiver. The bundled channel which data use Boolean levels to encode

information and separates request and acknowledge wires are bundled with the data

signal. The bundled data is also called single-rail data. The bundled data channel is

shown in Fig 2-1.

Four-phase bundled data handshake protocol is shown in Fig 2-2. It uses the ACK

and REQ signal to synchronize between receiver and sender. Four-phase protocol must

return to zero after a transmission success, so it also is known as the return-to-zero

(18)

8

is asserted to 1 by Sender. When Receiver accepts the REQ signal that is 1, Receiver

knows the data is valid and receive the data. Then, Receiver will assert ACK to 1 when

the data are already received. When Sender receives the ACK equal to 1 from Receiver,

it knows the data have already transmitted to Receiver. Sender will pulls down REQ

signal to 0 and stops transferring data. Finally, the Receiver also pulls down ACK to 0.

The transfer is finished.

Sender

Receiver

REQ

DATA

ACK

Fig 2-1 Bundled Data Channel

DATA

ACK

REQ

Fig 2-2 Four-Phase Handshake Protocol

2.1.3 Dual-Rail Data Encoding Methods

(19)

point that the system does not have REQ signal, and uses 2-bit to encode 1-bit data. The

dual-rail encoding method is different from bundled data. The encoding method is shown

in Table 2-1. d.t d.f 0 0 Empty “E” 0 1 Valid “0” 1 0 Valid “1” 1 1 Not used Table 2-1 Dual-Rail Encoding Method

Dual-rail means that each data bit is encoded into two wires which called “d.t” and “d.f”. The pattern of {d.t, d.t} represents as following: {0, 0} means an empty token; {0, 1} means a valid data “0”; {1, 0} means a valid data “1”; {1, 1} means an invalid data and are not used. If the system uses dual-rail protocol to transfer n-bits data, there will

have 2xn data lines. In addition, the system does not have REQ signal, the receiver needs

an extra circuit to detect the arrival of the data signal. This special circuit is called

complete detection in dual-rail system. Fig 2-3 is a diagram of dual-rail data encoding

(20)

10

Sender

Receiver

DATA

ACK

2xn

Fig 2-3 Diagram of Dual-Rail Data Encoding

Ack

Data

Empty Valid Empty Valid

Fig 2-4 4-Phase Dual-Rail Handshake Protocol

Fig 2-4 shows the process of data transferring using 4-phase dual-rail protocol.

Initially, data is Empty, and it means that the data are all 0 and Ack signal is 0.When data

become valid and Receiver use complete detection to detect the data is ready, Receiver

capture data and pulls up Ack signal to 1. Then sender stops sending data and data

changes to Empty. Finally, Receiver pulls down Ack and the transfer is finished.

Dual-rail protocol uses Empty to separate each Valid data. After the transfer is completed,

the data wires will return to Empty. Thus, the sequence of data transfer is

(21)

modules based on dual-rail data encoding that is delay-insensitive, it works correctly

regardless of the delays in gates and wires.

E

0

1

Fig 2-5 Transfer Diagram

Input 1 Input 2 Output

0 0 0

0 1 Not change 1 0 Not change

1 1 1

Table 2-2 Truth Table of Muller C-element

2.2 Asynchronous pipeline

There are several asynchronous pipeline implementation styles have been proposed.

Muller pipeline is one of the most important models in asynchronous pipeline. Muller pipeline

(22)

12

Muller C-element. Usually, we use the Muller C-element to control asynchronous pipeline

and implement complete detection. In this section, we introduce basic components of

asynchronous dual-rail pipeline.

2.2.1 Muller C-element

Muller C-element plays an important role on asynchronous circuit. It is a

state-holding element just like an asynchronous set-reset latch. The behaviors of Muller

C-element are shown in Table 2-2. When both inputs are 0, the output will be set to 0.

When both inputs are 1, the output will be set to 1. When the inputs are different from

each other, the output will be kept the previous output. In Fig 2-6, we show the gate level

implementation of Muller C-element. Muller C-element with reset is shown in Fig 2-7.

A

B

C

Z

(23)

A

B

Z

reset

A

B

C

Z

reset

Fig 2-7 Muller C-element and its RTL Implementation with Reset

2.2.2 Complete Detection

The valid data in dual-rail data encoding implementation is detected by complete

detection. If complete detection has too many inputs, a lots of C-elements are needed in

complete detection implementation. It also means very big overhead in design. In order

to reduce the overhead, we choose the alternative complete detection. The 3-bits

alternative complete detection is shown in Fig 2-8. d.t[0] d.f[0] d.t[1] d.f[1] d.t[2] d.f[2] Ackout C

(24)

14

2.2.3 4-Phase Dual-Rail Pipeline

As mention above, Muller pipeline is a very popular implementation style in

asynchronous circuits design. We can use bundled data or dual-rail data based on Muller

pipeline to create design that we want. Bundled data Muller pipeline is shown in Fig 2-9.

A Muller pipeline would generate local clock pulses that just like clock of synchronous

circuits. Every stage will generate an ACK signal to previous stage and a REQ signal to

next stage. A four-phase bundle data pipeline with function blocks is shown in Fig 2-10.

If a pipeline with data processing, the combinational circuits can be added between the

stages. In order to maintain correct behaviors of every stage, matching delay must be

inserted in the request signal paths.

Another encoding way to implement Muller pipeline is dual-rail encoding method.

Because dual-rail circuit does not have the REQ signal, we must detect the data that is

already valid or not. We use the detection to decide the ACK whether pulls down or up.

The four-phase dual-rail data pipeline must alternatively send the empty token and valid

(25)

LEFT RIGHT REQ ACK REQ REQ REQ ACK ACK

C[i-1] C[i] C[i+1]

C C C

… …

ACK

Fig 2-9 Four-Phase Bundled Data Pipeline

C

Latch Latch Latch

En En En

REQ REQ REQ

ACK ACK ACK ACK REQ Function Block Function Block Function Block Delay Delay Delay C C

Fig 2-10 Four-Phase Bundled Data Pipeline with Function Block

C ACK ACK ACK ACK C C C C C C C C C C C C C C a.t a.f b.t b.t

(26)

16

2.3 Loop Detection

As mention previously, we could use the characteristic of loops in embedded systems.

These loops are small and frequently executed. It is widely known that loops have the

temporal and spatial locality. In other words, the same instructions of a loop would be

repeatedly accessed in the certain period of program execution. Often the certain period of

program execution is finished, the loop may be accessed infrequent or never be accessed

again. Thus, we must determine if the current loop should be placed into the loop buffer. As

soon as the loop becomes hot, we should place it into the loop buffer. We can use the

instructions of loop buffer to reduce the operations in instruction fetch and decoded stage,

because the instructions of frequent loops are already put in the buffer. Thus, repeated

instruction fetch and decode operations can be avoided. Otherwise, if the instructions of loop

buffer are not accessed often when the loop is detected, the instructions of the loop in the loop

buffer should be replaced. In this section, we will introduce two basic methods to decide

whether the loop should be placed into the buffer, and we also explain why the two methods

would bring some problems.

2.3.1 Counter-Based Loop Detection

The first method is counter-based loop detection [1]. The method takes advantage of the loops that may have a backward branch instruction. This characteristic is based on a special class of branch instructions, called the short backward branch instruction (sbb).

(27)

Short backward branch instruction is shown in Fig 2-12.

Because it is based on short backward branch instruction, the upper displacements

are all ones (indicating a negative branch displacement). The lower portion of

displacement is w-bits wide. By definition, a sbb has a maximum backward branch

distance given by

2

w instructions. The size of loop buffer is also given by

2

w

instructions. When a sbb instruction is detected and found to be taken, the hardware

assumes that a program is executing a loop and initiates all the suitable control actions to

utilize the loop buffer. The sbb is called the triggering sbb.

opcode 111…111 xx...xx

upper displacement lower displacement w bits Branch displacement

Fig 2-12 Short Backward Branch Instruction

Count_Register +1 Comparator 0 Increment Counter Triggering_sbb count w w w w / / _/ /

(28)

18

Counter-based scheme to monitor sbb executions is shown in Fig 2-13. When a sbb

is encountered and taken, its lower displacement is load into a w-bit increment counter

called Count_Register. The sbb becomes the triggering sbb, and the loop buffer

controller enters FILL state. In this state, the instructions that being fetched from

instruction cache fill into the loop buffer by controller. The hardware increases this

negative displacement by one, each time an instruction is executed sequentially. As the

negative value in the Count_Register increases and becomes zero, the controller knows

that the instruction currently being executed is the triggering sbb. If the triggering sbb is

not taken, controller returns to the IDLE state. Otherwise, it enters the ACTIVE state. In

the ACTIVE state, the instructions that originally request to the instruction cache would

be directed to the loop buffer by the controller.

Although the method only increases a little overheads and it is very simple, it also

have some critical problems. First, the loop that has a triggering sbb cannot have any

nested loop. Second, the loop just taken one time, and the instructions of the loop would

be putted into loop buffer. Thus, a frequently executed loop might be replacement by a

loop that just taken. Third, the loop buffer only stores the instructions of a loop, the other

unused space of loop buffer is wasted.

2.3.2 Basic Block Loop Detection

(29)

counter-based loop detection, this way is more flexible and simple. Basic block loop

detection regards a program is composed of many basic blocks. The basic block is a

straight-line code sequence composed of non-branch instructions and one branch

instruction, which determines the direction of the following instruction stream. The basic

block in program is shown in Fig 2-14.

Non-branch instrucitons Branch instruction Basic block 1: Basic block 2: Basic block 3: Non-branch instrucitons Branch instruction Non-branch instrucitons Branch instruction PBAR SC

Fig 2-14 Basic Blocks in Instruction Stream

(30)

20

would jump to a non-branch instruction when those branch instructions taken.

Basic block loop detection could be described as follows:

(1) When a branch instruction is fetched, the hardware would compare the address

of current branch instruction with the value of Previous Branch Address Register

(PBAR). If the two values are equal to each other, it means that the loop was executed

because the same branch instruction is fetched again. Thus, if two values match and the

current branch instruction is predicted as a taken branch, same instruction within the loop

will be fetched again. At the same time, these instructions are transfer to the loop buffer.

Otherwise, if the two values mismatch, it means that a new basic block will be

executed. In this status, the PBAR would store the address of current branch instruction

and reset the value in Size Counter (SC).

(2) When a non-branch instruction is fetched, the value in Size Counter would be

increased by 1. Until a branch instruction is encountered, Size Counter always is added

one because of fetching non-branch instruction. Thus, the number of instructions of the

previously executed loop could be got from Size Counter.

However, this way also has some troubles. Comparing with the counter-based loop

detection, basic block loop detection does not check the displacement of branch

instruction. It can reduce some overheads in hardware, but it is also the major problem in

(31)

by an infrequent loop. In other word, the frequent loop must be fetched and transferred

into loop buffer again, because it is replaced by an infrequent loop.

2.4 Decoded Filter Cache

In synchronous embedded processors, instruction fetch and decode unit usually take a

large percentage of total power dissipation of embedded processor [5] [6]. For example,

StrongARM spends more than 40% in instruction fetch and decode [5] [6]. The power

dissipation by different components of StrongARM is shown in Table 2-3. However, we could

take advantages of loop buffer to reduce the power consumption of instruction fetch [4]. The

decode unit still repeatedly executes the same instruction of the frequent loop.

Instruction cache

Instruction decode

Data cache

Clock

Execution

other

27%

18%

16%

10%

8%

21%

Table 2-3 Power Dissipation in StrongARM

The technique uses decode filter cache to save power consumption in embedded

(32)

22

When a hit in DFC, the DFC eliminates one fetch from the I-cache and the subsequent decode,

which results in power saving. DFC results in 50% more power savings than an instruction

filter cache [4] and the average reduction in processor power is 34% [5]. Processor pipeline

with decode filter cache is shown in Fig 2-15. In Fig 2-15, there are five stages. When an

instruction is decoded, it will be stored into DFC. Once the instruction is executed again, it

can be accessed from DFC and will avoid I-cache to be accessed. DFC not only can decrease

the times of accessing I-cache but some actions of decoded stage. But there are some

problems with the DFC, such as unnecessary actions. DFC will store decoded instructions

whether the instructions is in a loop or not. If the instructions are not in a loop, they will not

be executed again. Thus, storing these instructions would waste the space of DFC.

DFC

I-cache

Fetch

Decode

Execute

Mem

write back

(33)

Chapter 3 The Design of Predecode Loop Buffer

In this chapter, we will present the asynchronous 2-way VLIW processor in the

beginning. Including pipeline architecture and basic components and branch instruction, we

show details in the chapter 3.1. Then we introduce the design of predecode loop buffer for the

target asynchronous two-way VLIW processor. Predecode loop buffer uses an easy and novel

method to decide whether loop would be replaced or stored, and we will decrease overheads

as far as possible. A miss of predecode loop buffer deals hardly with pipeline, but

Asynchronous Muller Pipeline only has 50% utilization that could help to solve some

problems. We will present our miss recovery mechanism in chapter 3.2.5.

3.1 Asynchronous 2-Way VLIW Processor

We use the 4-phase dual-rail data encoding method to implement asynchronous two-way

VLIW processor. It has its own instruction set architecture (ISA) and has six stages in pipeline.

In addition to R-type, I-type instruction that like MIPS, it also supports SIMD and MAC

instructions. We use the feature of Muller Pipeline to handle branch instruction.

3.1.1 Pipeline Architecture

As mention previously, we use 4-phase handshake protocol and dual-rail data encoding method to control pipeline. The pipeline architecture of asynchronous two-way

(34)

24

Instruction Prefetch (IPF), Instruction Dispatch (IDP), Instruction Decode (ID), Execution 1 (EX1), Execution 2 (EX2) and Write Back (WB).

MAC Decode LDSD Decode Instruction Prefetch Write Back MAC_EX2 MAC_EX1 Dispatch Register Bank LDSD_EX1 LDSD_EX2 Data Memory Sy n< ─ > A sy n In te rf ac e Sy n< ─ > A sy n In te rf ac e Instruction Memory

Fig 3-1 The Pipeline Architecture of Asynchronous Two Way Processor

The followings are actions of the six stages of the target asynchronous two-way

VLIW processor:

1. Instruction Prefetch: According to PC (Program Counter), instruction would be

fetched from instruction memory via an sync/async interface. The interface can

transfer the encoding method of synchronous circuit and asynchronous circuit to

(35)

2. Instruction Dispatch: This stage would dispatch these instructions to appropriate

decode unit. Because our instruction is compressed, the last bit of the instruction

can be needed to decide if it can be executed in parallel with next instruction.

3. Instruction Decode: The major work of this stage is to decide instruction and

fetching operand. The instruction will be needed to generate the control signals

controlling its execution in following stages.

4. Execution 1 and Execution 2: We separate the execution stage into two stages,

Execution 1 and Execution 2. The Mac instruction is only executed in the left part

of EX2 stage. The Load and Store instruction are only executed in the right part of

EX2 stage. In addition to Mac, Load and Store instruction, other instructions can be

executed in both the left part and the right part.

5. Write Back: In this stage, the result can be written back to the correct destination

register. If the instruction does not need to write result back, this stage will be

bypassed.

There are pipeline latches between each two stages. Because of asynchronous

pipeline latch is clockless, we use C-element to implement the pipeline latch that is

composed of C_latch. C_latch is shown in Fig 3-2. Another special component is the

asynchronous register. Each 1-bit register is made of four AND gates, two OR gates, and

(36)

26

register. The 1-bit register is shown in Fig 3-3.

Ack_in Ack_out In.t In.f Out.t Out.f C C Fig 3-2 C_Latch

din.t

din.f

dout.t

dout.f

read

~read

ack

Fig 3-3 The 1-bit Register

3.1.2 Branch Instruction

In this section, we present our efficient method to handle branch instruction. The

(37)

architecture is show in Fig 3-4. In Fig 3-4, I&V register is a component to control where

instruction are written into Buffer. Buffer stores the instructions that are already fetched.

The value of Dir register lets Prefetch unit know which part of instructions should be

fetched. The grey part of register is stand for timing of writing register.

Fig 3-4 Branch Handling Architecture

When branch instruction is taken, the MAC EX1 would send a stall control signal to

Stall unit and send target address to PC register and Dir register. Stall control signal lets

Stall unit send a bubble to next stage and clear the content of I&V register. Finally, PC Dispatch Prefetch MAC decode MAC EX1 LDST decode LDST EX1

…

pc pc+1 PC Buffer I & V |Dir | V | I | PC | PC’ PC stall ctl target address pc_mux Dir Dir Calculate PC’ Stall

(38)

28

element. The Merge element is called pc_mux in Fig 3-5. However, there is a critical

issue that must be guaranteed, the timing issue. Because of branch instruction, PC

register is possibly written by different stages in our pipeline architecture. In our branch

handling architecture, there are three combinations when a branch instruction taken. The

combination of Valid token and Empty token are shown in Fig 3-5.

Empty

Valid

Empty

Valid

Empty

Valid

Empty

Valid

Instruction Prefetch Instruction Dispatch Instruction Decode Ex1 (a) (b) (c)

Fig 3-5 The Combination of Valid Token and Empty Token

In Fig 3-5 (a) and (c), branch instruction would not produce any problem in our

branch handling architecture. In Fig 3-5 (b), because the next instruction of branch

instruction is too slow, so that it is still in Instruction Prefetch stage. At the same time,

branch instruction in EX1 stage is already completed and the result is generated. Thus,

(39)

would be written into PC register when it enters the Instruction Dispatch stage.

To solving this trouble, we assume some constraints for branch instruction. We

assume a time stamp, Td. The time stamp is that the next instruction of branch instruction

is completed in Dispatch stage and new pc address is sent from Dispatch stage. The time

stamp of branch instruction sending new target address from exe1 stage is Tb. It must be

guaranteed that Tb > Td. In other words, we must guarantee the case (b) and case (c) in

Fig 3-5 would not be happened.

3.2 Predecode Loop Buffer

This section will present the architecture of predecode loop buffer. As above-mentioned

in chapter 2, traditional loop detection would make some possible problem. In addition to

loop that must not be nested loop, the major trouble is that the frequently executed

instructions in loop buffer may be replaced by infrequently executed instructions. If the

infrequently executed instructions only execute sometimes, this replacement will become an

unnecessary action. For example, Loop 2 and Loop 3 are inside Loop 1. When the Loop 1

executes one time, the Loop 2 only executes two or three time and the Loop 3 executes

hundreds of times. In other words, Loop 2 only obtains few benefits from loop buffer.

Comparing with Loop 2, Loop 3 would get many advantages of loop buffer, because Loop 3

(40)

30

again. This replacement is an unnecessary operation. If Loop 1 also executes many times, the

unnecessary operation will become numerous. The example of program is shown in Fig 3-6.

Non-Branch Instruction Non-Branch Instruction Loop2 Loop3 Branch Instruction Branch Instruction Branch Instruction Loop1

Fig 3-6 The Example of Program

Although we take advantage of loop buffer, but there still exist repeated actions in

Instruction Decode stage. In Instruction Decode stage, it has to decode instruction and fetch

operands from register file. Fetching operand operations cannot be ignored, because the

operands of the same instruction would be different in every time of execution. Comparing

with fetching operands action of decoding instruction is always the same as previous

execution despite the instruction executing hundreds of times. Thus, Control signals are stored

(41)

keep the control signals to reduce some operations of Instruction Decode stage. Because the

control signals of instruction are more than the opcode of instruction, Predecode Loop Buffer

is bigger than traditional loop buffer. That is a tradeoff between area and performance. The

tradeoff is worth implementing, because the operations of Instruction Decode stage are one of

the major factors of performance of system. The details of Predecode Loop Buffer will be

presented at the following sections in this chapter. The Block Diagram of Instruction Decode

stage is shown in Fig 3-7.

MAC Decode LDST Decode Register Bank opcode opcode

Instruction of MAC Decode Instruction of LDST Decode

Control signal of MAC Ex1 Control signal of MAC Ex1 Register control signal Register control signal operands of MAC Ex1 operands of LDST Ex1 rd,rs,rt rd,rs,rt

(42)

32 F-reg Mode Controller

Instruction

Prefetch

Execute 1

Instruction

Decode

Instruction

Dispatch

Pipeline Latch Pipeline Latch Pipeline Latch F F Branch Information Table Branch ? Mode bit Information Predecode Loop Buffer PC Control Signal S Stall Miss Read Decoded signal Index Change Taken? Dir Dir Dir’ Dir S-reg F 0 S F PC

Fig 3-8 The Architecture of Predecode Loop Buffer

3.2.1 Overview

The architecture of Predecode Loop Buffer includes a Branch Information Table, a

Mode Controller, a Loop Buffer and three registers that include two mode registers and

an index register. The architecture of Predecode Loop Buffer is shown in Fig 3-8. Branch

Information Table would store some information of branch instruction, such as the taken

times of a branch, and information about this branch whether it would become hot or not.

The hot branch stands for that the branch is frequently executed. Mode Controller would

(43)

write information to Branch Information Table and mode registers. Every instruction

would be decided if it is accessed from Predecode Loop Buffer or stored the control

signals into Predecode Loop Buffer. Two mode registers are Store register (S-reg) and

Fast Access register (F-reg). When the value of S-reg is one, the control signals of

instruction have to be stored into Predecode after the instruction is completely decoded.

When the content of F-reg is one, the instruction can be accessed from Predecode Loop

Buffer and Instruction Prefetch unit can be bypassed. Thus, we need some elements to

bypass the Instruction Prefetch unit. These elements are DeMux and Merge. The DeMux

and Merge are shown in Fig 3-9.

Function block Function block In Sel.t Sel.f C C DeMUX Merge Out

Fig 3-9 The DeMux and Merge

When Sel.t equals to one and Sel.f equals to zero, the In signal will be passed to

(44)

34

only use a OR gate, because there are only one function block can receive the valid token

from DeMux and only on function block can pass valid token to Merge. Thus, another

function block would not work and it would be bypassed.

3.2.2 Branch Information Table

Branch Information Table (BIT) has four fields that including Tag, Execute Counter,

Frequent Flag and Pre-Frequent Flag. When a branch instruction is in the Instruction

Decode stage, Instruction Decode unit will send control signals to BIT. The BIT will be

searched by the PC value of the branch instruction and the control signals. If the

information of the branch instruction exists in BIT, the information of the branch

instruction would be accessed and be transferred to Mode Controller in EX1 stage.

Otherwise, the valid zero would be sent to Mode Controller. The Branch Information

Table is shown in Fig 3-10.

The PC of branch

Tag Execute

Counter Frequent_Flag Pre-Frequent _Flag

... ... ... ...

(45)

The reading action of BIT is in Instruction Decode stage and the writing action of

BIT is in EX1 stage. Although these actions are in different stages, they are possible to

produce a problem. When the Instruction Decode stage is completed, the valid token

should be in Ex 1 stage and Instruction Decode stage should be returned to zero. It stands

for the empty token. However, if the delay time of Ex 1 stage from this stage starts to

finish shorter than the delay time of returning Instruction Decode stage to zero, BIT will

be accessed and written at the same time. To avoid this trouble, we must provide that

writing BIT is after Instruction Decode stage returns to zero.

d1.t d1.f d2.t d2.f out1.t out1.f out2.t out2.f

Fig 3-11 The Mutual Exclusion

To achieving this goal, both read and write signals are connected to a Mutual

(46)

36

the faster input that arrives in ME and d2 is the slower input that arrives in ME. The out1

would become valid because d1 is faster. Although the d2 is also valid, out2 still is empty.

Out1 will become empty after d1 is returned to zero. Then d2 just can let out2 become

valid. This section finally presents the BIT block diagram and shows how BIT can be

bypassed. BIT block diagram is shown in Fig 3-12. If PC is sent to BIT, the data of

accessing BIT will become R_Information. Otherwise, the R_Information will be

inserted with valid zero. For writing of BIT, it is like the reading of BIT. When BIT must

be written new data, wack is inserted with one by BIT. On the other hand, wack will be

inserted with one as fast as possible.

Branch

Information

Table

PC Branch? D eM ux Merge R_Information D eM ux Merge wack W_Information write

(47)

3.2.3 Mode Controller

The value of mode registers is decided by Mode Controller. Depending on the value

in mode registers, the instructions are decided if they are written into Predecode Loop

Buffer or are accessed from Predecode Loop Buffer. Thus, Mode Controller is an

important component in the architecture of Predecode Loop Buffer. There are three

modes in our Predecode Loop Buffer, including Direct Mode (DM), Fast Access Mode

(FAM) and Storage Mode (SM). DM means that the instruction will be fetched from

Instruction Memory. FAM means that the instruction will be fetched from Predecode

Loop Buffer. SM means that the control signals of the instruction must be written into

Predecode Loop Buffer after the instruction is completely decoded. In addition to the

mode registers, Mode Controller needs an extra register to calculate the count of hot

branch and the count of non-hot branch. The register is called “replace register”.

Depending to the value in this register, Mode Controller can decide the instruction of

Predecode Loop Buffer if it is frequently executed in recently period of execution.

Storage Fast

Access Direct

(48)

38

Fetch Mode Events

Direct Mode

(1) Predecode Loop Buffer misses

(2) (BIT hit)&&(Frequent Flag and Pre-Frequent Flag are 0s)&&(Execute

Counter < threshold)

(3)(BIT hit)&&(Frequent Flag or Pre-Frequent Flag are 1s)&&(branch is not

taken) Fast Access

Mode

(BIT hit)&&(Frequent Flag or Pre-Frequent Flag are 1s)&&(branch is taken)

Storage

Mode

(BIT hit)&&(Frequent Flag and Pre-Frequent Flag are 0s) &&(branch is

taken)&&(Execute Counter ≧ threshold)

Table 3-1 The Transition Events of Mode

The possible transitions of Mode are shown in Fig 3-13. The events that cause mode

transitions are shown in Table 3-1. When the BIT is accessed by a branch instruction, the

replace register is also accessed. If the hot branch is taken, Mode Controller would let the

replace register be decreased by one. Otherwise, if the non-hot branch is taken, Mode

Controller would let replace register be increased by one. If the value overflows after

increasing by one, the data is Predecode Loop Buffer are not frequently executed in

recent period of execution. Thus, Mode Controller would clean the Frequent Flag of

Branch Information Table and Pre-Frequent Flag would not be cleaned. Although

(49)

Depending upon the value of Pre-Frequent Flag, the fetch mode will be changed to FAM.

A program usually is composed of many phases. When the program enters the program

phase, it will execute instructions in the same block. That block usually contains some or

many of loops. In other words, when the program enters another program phase,

Frequent Flag should be flush because branch information in BIT may be useless. The

Transition Phase and Program Phase are shown in Fig 3-14.

Mode Controller lets BIT be always in two kinds of phase. One is writing phase and

another is monitor phase. In writing phase, the Frequent Flag is cleaned. At this time,

BIT enters another program phase. In monitor phase, it means that at least one Frequent

Flag of BIT is one. At this time, BIT will stay at the same program phase.

Writing Monitoring

Transition Phase Program

Program Phase 1 Program Phase 2 Program Phase 3

Program Phase N

... ...

Fig 3-14 The Transition Phase and Program Phase

(50)

40

the two figures.

DeMux Information Branch Mode Controller Taken Merge w_information w_ack controller_complete

Fig 3-15 The Block Diagram of Mode Controller

3.2.4 Procedure of Predecode Loop Buffer

As above sections, every component of Predecode Loop Buffer is presented. In this

section, we will introduce the procedure of Predecode Loop Buffer. The procedure

includes four steps:

Step 1: Because the Execution Count of branch instruction smaller than threshold, Mode

Controller will not change the value in mode register.

Step 2: When the Execution Count of branch instruction equals to the threshold, Mode

Controller will set S-reg and Frequent Flag to one. The control signals of next

(51)

Step 3: The Frequent Flag of branch instruction is one, so the next instruction is existed

in Predecode Loop Buffer. Thus, Mode Controller sets the F-reg to one, and the

next instruction can be accessed from Predecode Loop Buffer.

Step 4: When a miss is produced, it stands for the next instruction that must be fetched

from Instruction Memory. In next section, we will present details about miss

recovery mechanism of Predecode Loop Buffer.

ADD R3,R2,R1 ADD R6,R4,R5

BEQ R9,R3,target NOP

... ... SUB R3,R2,R1 SUB R6,R4,R5 Instruction Prefetch Instruction Decode EX1 Instruction Dispatch BEQ SUB ADD

Fig 3-16 A Frequent Loop in a Program

Fig 3-16 shows a frequent loop in a program. When the BEQ instruction is decoded

first time, the information of BEQ cannot be accessed from BIT. Because this is the first

time to execute, the information of the BEQ are all zero. As soon as the decode stage is

(52)

42

mechanism is implemented in our processor, much design overhead can be reduced,

Mode Controller only needs to wait the result of BEQ whether it is taken or not. If the

BEQ is taken, Mode Controller will add Execution Counter by one and decide which

mode should be written into mode registers. Otherwise, Mode Controller will pass the

action of increasing Execution Counter. We assume the BEQ is taken. Because the BEQ

is executed first time, the mode will not be changed. At the same time, the SUB

instruction will be flushed by branch handling mechanism. Every time the BEQ comes

before Execution Counter equals to the threshold, these actions will be repeated. When

the value of Execution Counter equals to the threshold, Mode Controller sets S-reg to

one in EX1 stage. Thus, the control signals of next instruction should be written into

Predecode Loop Buffer. In addition to the control signals of ADD instruction, some of

the control signals in Instruction Prefetch and Instruction Dispatch also need to be stored.

At this moment, storing these signals may have a problem. An instruction in our VLIW

processor may be dispatched two times, because the instructions are stored in

(53)

Instruction 1 Instruction 2 VLIW Instruction

Instruction 1 NOP Instruction 1 Instruction 2

NOP Instruction 2

Dispatch

(a)

(b)

Fig 3-17 The Cases of the Instruction

Instruction 1 Instruction 2 Index Pipeline Latch p-bit _p-bit change Dir Taken

Fig 3-18 The Hardware Solves the Instruction with Same Address

If the instruction 1 and instruction 2 can be executed in parallel, the last bit of

instruction 1 is one. If the last bit of instruction 1 is zero, the two instructions cannot be

(54)

44

would not produce any trouble because the two instructions can be executed in parallel.

In Fig 3-17 (a), storing control signal would produce a problem because the two

instructions cannot be executed in parallel. That is because they have same PC values.

The instruction must be dispatched two times. Thus, when this instruction is fetched

from Predecode Loop Buffer, we cannot decide which one is need.

To solve this problem, we add an XOR gate, an OR gate and an Index register. Fig

3-18 shows that the hardware solve that the instruction with same address. At first, we

can decide if the instruction is a complete VLIW instruction or half through Dispatch

stage according to p-bit. The p-bit of two instructions are connected to an OR gate. The

output of OR gate is called change bit. Index register stores the value of direction of the

VLIW instruction in EX1 stage. Then change bit and the output of Index register are

connected to a XOR gate. Finally, the index of next VLIW instructions through Dispatch

stage is the output of XOR gate.

The output would be computed in Decode stage and written into Index register in

EX1 stage. If the value of Index register is 1, the direction of instruction is left.

Otherwise, the direction of instruction is right. Table 3-2 shows the relationship between

(55)

p-bit of Instruction 1 p-bit of Instruction2 Change

0 0 0

0 1 1

1 0 1

1 1 Not Used

Table 3-2 The Relationship between P-Bit and Change Bit

Change Index of this instruction Index of next instruction

0 0 1

0 1 0

1 0 0

1 1 1

Table 3-3 The Relationship between Change Bit and Index

After we solve the trouble of writing data to Predecode Loop Buffer, every

instruction can be written and accessed at correct place. The next times of the BEQ that

is executed, Mode Controller receives the data from BIT. Because the Frequent Flag is

one, Mode Controller will set the F-reg to one in EX1 stage. The F-reg has the timing

(56)

46

possibly written by different stages. When the ADD instruction is executed again after

F-reg is set to one, the current instruction can be fetched from Predecode Loop Buffer in

Instruction Dispatch stage. The Instruction Prefetch and Dispatch stages can also be

bypassed. The bypass mechanism can also be implemented with DeMux and Merge pair.

When S-reg is set to 1, Predecode Loop Buffer does not have information that we

need. Otherwise, if F-reg is set to one, Predecode Loop Buffer already has data that we

need. Finally, there are three cases that would change the value of F-reg to zero. First, a

miss of Predecode Loop Buffer causes the information of instruction of loop that we

need is lost. Second, the branch instruction is hot branch, and it is untaken at this time.

Third, there are other branch instructions to change Mode.

3.2.5 Miss Recovery Mechanism

The major work of miss recovery mechanism is recovering the action of every stage

and fetching the miss instruction again from Instruction memory. To achieve this work,

we must flush that the wrong information in every stage. Fortunately, 4-phase dual-rail

pipeline only have 50% utilization and Predecode Loop Buffer is read in Instruction

Dispatch. In other words, we just need to deal with the miss in Instruction Dispatch,

because Instruction Prefetch and Instruction Decode stages are empty.

(57)

occurs, some signals and the value in F-reg should be re-transferred. At first, the value in

F-reg is changed to zero. The action can provide that the next instruction is with the

correct mode. Then the PC address of the miss instruction will be written into the PC

register. The action can provide that the miss instruction is fetched from instruction

memory with correct address.

Finally, we must flush this miss instruction. We take an easy and quick method to

solve this issue. In Fig 3-4, there is a stall unit. Thus, the miss signal will be sent to the

stall unit. Then the stall unit will insert a NOP to the next stage. This action is almost like

the branch instruction is taken and a bubble must be inserted to the next stage. Although

Predecode Loop Buffer can correctly work, there are some places that can be improved.

We will present these future works in chapter 5.

Fig 3-19 Miss Recovery Mechanism

Instruction

Dispatch

Pipeline Latch Predecode Loop Buffer Stall Miss Decoded signal F.A D.A PC

(58)

48

Chapter 4 SIMULATION RESULT

In this chapter, we will show the area and timing simulation of Predecode Loop

Buffer. The design was synthesized by Design Compiler under TSMC 0.13μ m2process.

First, we present the area of every element including C-element, 1-bit register, Branch

Information Table, Predecode Loop Buffer. We show the timing simulation of the design.

Fig 4-1 shows the timing diagram of Predecode Loop Buffer.

Fig 4-1 The Timing Diagram of Predecode Loop Buffer

4.1 Area Simulation

(59)

4-2 presents the area of Predecode Loop Buffer, Branch Information and Mode

Controller comparing with asynchronous two way processor.

1-bit register C_Latch C-element

Cell Area (μ m2) ~54 ~41 ~21

Table 4-1 the area of basic element in our Predecode Loop Buffer

BIT Predecode Loop Buffer Mode Controller Processor

Cell Area (μ m2) ~8600 ~240000 ~1200 ~900000

Ratio 0.95% 26.67% 0.1% 100%

Table 4-2 the area compares with asynchronous two way VLIW processor

The cell area of Predecode Loop Buffer is big, because it stores the control signals.

The control signals are bigger than instructions, so the Predecode Loop Buffer is bigger

than traditional loop buffer. The BIT and Mode Controller are small components in our

design.

4.2 Timing Simulation

(60)

50

BIT Predecode Loop Buffer Mode Controller

Delay (ns) 4.39 7.41 4.76

(61)

Chapter 5 CONCLUSIONS AND FUTURE WORKS

In this thesis, we propose a novel loop detection. It is implemented on an asynchronous

two way VLIW processor. Because the decoded instruction is a major operation of system, the

control signals are stored after the instruction is decoded to reduce some actions of instruction

decode. We use the bigger loop buffer to replace traditional loop buffer, this is a tradeoff.

Predecode Loop Buffer not only can decrease the accessed times of Instruction memory but

can decrease the operations of Instruction Dispatch and Instruction Decode stages.

Because the instructions are stored in compressed form, the instructions may be divided

to two parts and transferred to EX1 stage. When the control signals of the instructions are

stored in Predecode Loop Buffer, it will produce a problem. To solve this problem, we add

some components to differentiate the instruction that may be divided. The design was

synthesized by Design Compiler under TSMC 0.13μ m2process. The area of Predecode Loop

Buffer is 240000μ m2 and its delay time is 7.41 ns.

Although Predecode Loop Buffer can decrease the instruction memory access times, it is

bigger than traditional loop buffer. In addition to the control signals is stored in Predecode, the

instruction is compressed that also make less instruction to be stored in Predecode Loop

Buffer. It is an issue that the control signals of the instruction how can be efficiently stored in

(62)

52

REFERENCES

[1] Lea Hwang Lee, Bill Moyer, John Arends, “Instruction Fetch Energy, Reduction Using Loop Caches For Embedded Applications with Small Tight Loops”, in Proc. Int. Symp.

Low Power Element. Design, 1999, pp. 267-269.

[2] ChiTa Wu, Ang-Chih Hsieh, and TingTing Hwang, “Instruction Buffering for Nested

Loops in Low-Power Design”, in Circuits and Systems, 2002. ISCAS 2002. IEEE

International Symposium. vol. 4, 2002, pp. 81-84

[3] Na Ra Yang, Gilsang Yoon, Jeonghwan Lee, Jong Myon Kim, Intae Hwang, Cheol Hong Kim, “Loop Detection for Energy-aware High Performance Embedded Processors”, in

Asia-Pacific Services Computing Conference, 2008, pp. 1578-1583

[4] Weiyu Tang, Rajesh Gupta, Alexandru Nicolau, “Design of a predictive Filter Cache for Energy Savings in High Performance Processor Architectures”, in Int’l Conf. on Computer

Design, 2001

[5] Weiyu Tang, Rajesh Gupta, Alexandru Nicolau, “Power Savings in Embedded Processors through Decode Filter Cache”, in Design, Automation and Test in Europe Conference and

Exhibition, 2002, pp. 443-448

[6] Weiyu Tang, Arun Kejariwal, Alexander V. Veidennbaum and Alexandru Nicolau, “A Predictive Decode Filter Cache for Reducing Power Consumption in Embedded Processors”, in Transactions on Design Automation of Electronic Systems. vol 12, 2007

(63)

[7] T. Anderson and S. Agarwala, “Effective Hardware-Based Two-way Loop Cache for High Performance Low Power Processors”, in Proc. Int. Conf. Comput. Design, 2000, pp. 403-407.

[8] Ya-Lan Tsao, Wei-Hao Chen, WenSheng Cheng, Maw-Ching Lin and ShyhJye Jou, “Hardware Nested Looping of Parameterized and Embedded DSP Core”, in SOC

Conference, 2003. Proceedings. IEEE International, 2003, pp. 49-52.

[9] Marcos R. de Alba, David R. Kaeli, “Runtime Predictability of Loops”, in Workload

Characterization, 2001. IEEE International Workshop, 2001, pp. 91-98.

[10] Timothy Sherwood and Brad Calder, “Loop Termination Prediction”, in ISHPC, 2000, pp.73-87.

[11] M. Monchieroa, G. Palermoa, M. Samia, C. Silvanoa, , V. Zaccariab, R. Zafalonb, “Low-power branch prediction techniques for VLIW architectures: a compiler-hints based approach”, in INTEGRATION, 2005, pp.515-524.

非同步雙道超大指令字組處理器之預解碼迴圈緩衝器設計

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

非同步雙道超大指令字組處理器之

預解碼迴圈緩衝器設計

Predecode Loop Buffer Design for Asynchronous two-way VLIW

Processor

研 究 生：陳建志

非同步雙道超大指令字組處理器之預解碼迴圈緩衝器設計

Predecode Loop Buffer Design for Asynchronous two-way VLIW Processor

研 究 生：陳建志 Student：Jian-Jhih Chen

指導教授：陳昌居 Advisor：Chang-Jiu Chen

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

非同步雙道超大指令字組處理器之預解碼迴圈

緩衝器設計

研究生：陳建志

指導教授：陳昌居 教授

國立交通大學資訊科學與工程研究所

摘 要

Predecode Loop Buffer Design for Asynchronous

two-way VLIW Processor

Student：Jian-Jhih Chen

Advisor：Dr. Chang-Jiu Chen

Abstract

Acknowledgement

CONTENTS

LIST OF PICTURES

LIST OF TABLES

Chapter 1 Introduction

1.1 Motivation

1.2 The Organization of the Thesis

Chapter 2 Related Works

2.1 Asynchronous circuits

2.1.1 Advantages and drawbacks

2.1.2 4-Phase Handshake Protocol

Sender

Receiver

REQ

DATA

ACK

DATA

ACK

REQ

2.1.3 Dual-Rail Data Encoding Methods

Sender

Receiver

DATA

ACK

2xn

Ack

Data

E

0

1

2.2 Asynchronous pipeline

2.2.1 Muller C-element

A

A

B

B

C

Z

Z

A

B

Z

reset

A

B

研究生：陳建志

研究生：陳建志 Student：Jian-Jhih Chen

國立交通大學

資訊科學與工程研究所

碩士論文

指導教授：陳昌居教授

摘要