低延遲非同步直接記憶體存取控制器之設計

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

低延遲非同步直接記憶體存取控制器之設計

Low-Latency Design for Asynchronous DMA Controller

研究生：戴俊智

指導教授：陳昌居教授

(2)

低延遲非同步直接記憶體存取控制器之設計

Low-Latency Design for Asynchronous DMA Controller

研究生：戴俊智 Student：Chun-Chih Tai

指導教授：陳昌居 Advisor：Chang-Jiu Chen

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

August 2009

Hsinchu, Taiwan, Republic of China

(3)

低延遲非同步直接記憶體存取控制器之設計

學生：戴俊智

指導教授：陳昌居教授

國立交通大學

資訊科學與工程研究所

摘要

現在來說，嵌入式多媒體系統和數位訊號處理器是非常廣泛的被使用到，並且應用在許多地方。並且由於現在多媒體的影像畫質越來越好，相對的，所需要的資料量也就伴隨著越來越大。而一般來說，影像處理在一個數位訊號處理器上，常常需要非常大量的資料量，而這龐大的資料量必須被儲存在外部的記憶體，由於數位訊號處理器的內部快速記憶體是有限的原因。而處理器直接在外部記憶體處理資料，通常會有不好的效能影響。所以資料必須從外部記憶體搬運到內部記憶體。因此有效率的從外部記憶體將資料搬運到內部記憶體，就顯的相對的非常重要。因此我們提出一個降低直接記憶體存取控制器內的延遲時間的方法，而這個設計方法就是讓不同通道之間的轉換時間和計算搬運位置的時間平行的同時執行，以降低浪費的時間，讓整個直接記憶體存取控制器能夠有效率的將資料搬運資料，藉此提升直接記憶體存取控制器的效能，當用在數位訊號處理器上時，也希望能提升處理器的效能。

(4)

Low-Latency Design for Asynchronous DMA

Controller

student：Chun-Chih Tai

Advisors：Prof. Chang-Jiu Chen

Department of Computer Science and Information Engineering

National Chiao Tung University

Abstract

Because embedded multimedia systems and digital signal processors are used popularly

and apply many wide range. Because the quality of multimedia image becomes more and

more higher, multimedia image carry enormous amount of data. In general, image processing

on a digital signal processor often requires a lot of image data to be stored in external memory,

because the size of fast internal memory is usually very limited. The efficient data transfer

becomes very important.

We present a method that reduced latency of direct memory access controller. The design

is that digital signal processor processing should be overlapped with direct memory access

controller processing. Therefore, this design reduced 90% the internal latency of direct

(5)

Acknowledgement

本論文能夠順利完成，首先要感謝陳昌居老師這兩年來的細心指導，以及口試委員的寶貴建議，讓我把論文修的夠完整。再來就是要感謝我們實驗室的學長緯民、元騰、宏岳一開始的建議，並且在寫論文過程中的熱心指導和幫忙，還有我們實驗室的同學嘉承、榮祥、俊偉的熱心幫助和鼓勵，以及學弟們的加油打氣，感謝大家陪我度過這兩年的研究所時光。最後還要感謝我女朋友一路的陪伴，並且鼓勵我支持下去，以及家人非常重要的支持，讓我能專心的把研究所學業完成。

(6)

2.4 Problem... 16 2.5 Comparison... 17 Chapter 3 Design ... 20 3.1 Data flow ... 20 3.2 Register Formats... 23 3.3 Architecture ... 25 3.3.1 Transfer Engine ... 26 3.3.2 Address Interface ... 29 3.3.3 Counter ... 34 Chapter 4 Simulation ... 38 4.1 Area Simulation ... 38 4.2 Timing Simulation ... 39 Chapter 5 Conclusion ... 44 References ... 45

(7)

List of Figures

Figure 1 (a) Bundle data channel... 6

Figure 1 (b) 4-phase protocol ... 6

Figure 2 (a) Dual-rail... 8

Figure 2 (b) Dual-rail protocol ... 8

Figure 3 Transfer diagram ... 9

Figure 4 (a) the symbol of Muller C-element and its gate level... 10

Figure 4 (b) C-element with reset... 10

Figure 5 (a) Four-phase bundle data Muller pipeline. ... 12

Figure 5 (b) Four-phase bundle data Muller pipeline with matching delay. ... 12

Figure 6 DMA controller structure ... 16

Figure 7 (a) Architecture of one interface ... 17

Figure 7 (b) Architecture of two interfaces ... 18

Figure 8 (a) Timeline of one interface ... 19

Figure 8 (b) Timeline of two interface ... 19

Figure 9 Data flow 1... 21

Figure 10 Data flow 2... 22

Figure 11 (a) Source register ... 23

Figure 11 (b) Destination register... 23

Figure 11 (c) Count register... 24

Figure 11 (d) Control register ... 24

Figure 11 (e) Content of the control register ... 24

Figure 12 Architecture ... 25

Figure 13 Transfer Engine ... 27

Figure 14 COMPARE_1... 28

Figure 15 The gate level of Detection ... 28

Figure 16 Engine Register ... 29

Figure 17 Address Interface... 30

Figure 18 (a) DECISION... 31

Figure 18 (b) components of DECISION... 31

Figure 19 SELECT_1 Generator ... 32

Figure 20 SELECT_2 Generator ... 33

Figure 21 MUX ... 34

Figure 22 (a) COUNTER ... 36

(8)

Figure 23 (a) The last data of the original architecture ... 40

Figure 23 (b) The first data of the original architecture ... 40

Figure 24 (a) The last data of the improved architecture ... 41

(9)

List of Tables

Table 1 Encoding method ... 7

Table 2 the function of Muller C-element ... 9

Table 3 Comparison of data transfers ... 13

Table 4 The improved architecture ... 39

Table 5 The original architecture... 39

Table 6 Comparison of Latency... 42

(10)

Chapter 1 Introduction

In recent years, embedded multimedia system and Digital Signal Processor (DSP)

become more and more popular. Because the quality requirement of the images become

higher and higher, the Digital Signal Processor (DSP) needs to handle a lot of information of

the image data. However, most image data are stored in the external memory. That’s because

the size of fast internal memory is usually very limited. Thus, the efficiency of data transfers

of Direct Memory Access (DMA) is becoming one of the most important part of these

embedded multimedia systems [4].

Another important topic is the Direct Memory Access Controller (DMAC). The

processor often uses Direct Memory Access Controller to transfer data and there are many

related research about it. It is widely known that Direct Memory Access transfers has many

advantages, such as eliminating overhead of CPU, increasing processor performance etc. So

we use Direct Memory Access Controller to transfer data for the processor.

It is widely known that low power and solutions are two of the most important issues for

embedded multimedia systems. Asynchronous circuits can easily address these issues [3]. In

order to design the deal embedded multimedia systems with asynchronous circuits, an

(11)

different from synchronous circuits design, asynchronous circuits design is an emerging way.

Asynchronous circuits use handshaking protocols rather than the global clock to synchronize

the communication. In addition, the power consumption is lower than synchronous circuits

inherently. That’s because asynchronous circuit have almost zero power dissipation when

there is no work to do. Furthermore, asynchronous circuits design still has other advantages

for circuits, including no clock skew problem, low EMI, and more robustness for environment

(12)

Chapter 2 Related Works

2.1 Asynchronous Circuits Design

Asynchronous circuits design is an more emerging circuit design. There are many

differences to traditional circuit designs and the asynchronous circuits. Asynchronous

circuits have lots of advantages and numerous unusual circuit designs.

2.1.1 Advantages

Comparing with synchronous circuits, the asynchronous circuits do not have a

global clock. They use handshaking to complete the communications. As a result of no

global clock, there are lots of advantages in the asynchronous circuits:

z No clock skew: Clock skew is the difference in arrival times of the clock signal at

different parts of the global circuit. Since asynchronous circuits, by definition, have

no global clock, there is no need to worry about clock skew. In contrast, the designer

of a synchronous circuit must often slow down its operation in order to

accommodate the skew.

z Low power requirement: Asynchronous circuits have the potential for very low

(13)

hence saving power. These would have been consumed by modules clocked by a

global clock in synchronous circuits while there is no data to process. This is an

increasingly important issue for VLSI, especially since more and more systems are

becoming portable. In fact, the clock power in a synchronous circuit consumes up to

36% to 40% total power [5,6].For these portable systems the advantage of lower

power consumption and simpler power distribution is self-evident: longer battery

life.

z Average case performance: The elasticity of the asynchronous pipeline has led to the

result that the asynchronous pipeline can achieve its processing in the average case

rather than the worst case performance for each stage. When an asynchronous circuit

completes its work, it can transfer data immediately. The asynchronous circuits do

not need to pay attention to other circuits; therefore, the asynchronous circuits do not

need to wait the slowest circuits. On the contrary, the stages of the synchronous

circuits should select execution the time of the critical path to be the clock cycle

time. So the synchronous circuits are worst case performance. But the stages of the

asynchronous circuits are independent. Each stage has its complete time and does

not influence other stages.

z Modularity: Different synchronous designs may have different clock requirements,

(14)

On the contrary, the asynchronous designs use handshaking to communicate with

each other. The designers only need to know the inputs, outputs and the way of the

handshaking. The asynchronous circuits can be easily applied to each kind of

circuits, and the designers do not need to take care of the different speed between the

circuits. The modular approach to asynchronous circuits where each modular part

making the whole system is self-timed.

2.1.2 Handshake protocols

Handshaking in the asynchronous circuits have two common protocols: one is

bundle data protocol (Figure 1 (a)), the other is dual-rail protocol (Figure 2 (a)).

Four-phase bundle data has REQ and ACK signals to control all of the transfer steps

(Figure 1 (b)). Initially, REQ and ACK are all 0. When DATA in the Sender is ready,

REQ signal is pulled up to 1 (1), and then the Receiver captures DATA and ACK

signal is pulled up to 1 (2). At the time the Sender receives ACK signal, the Sender

pulls down REQ signal to 0 and stops sending DATA. Finally, the Receiver pulls

(15)

SENDER

RECIEVER

Figure 1 (a) Bundle data channel

Figure 1 (b) 4-phase protocol

The other way of handshaking is using dual-rail protocol. The special point of

dual-rail is that the system does not have REQ signal, and use 2-bits to encode 1-bit

data. The encoding method is shown in Table 1. It use 00 to show that there is no

data (EMPTY), use 01 to encode the data of 0 (VALID) and use 10 to encode the

data of 1 (VALID). If the system use dual-rail protocol to transfer n-bits data, there

(16)

d.t d.f

Empty “E” 0 0

Valid “0” 0 1

Valid “1” 1 0

Not used 1 1

Table 1 Encoding method

Because of the system does not have REQ signal, the RECEIVER needs extra

circuits to detect the arrival of the DATA signals. This special design in dual-rail

system is called complete detection.

Figure 2 (b) shows the process of data transfer using dual-rail protocol. Initially,

DATA is EMPTY (all 0), and ACK signal is 0. When DATA is valid and the

RECEIVER detects that DATA is ready, the RECEIVER captures DATA and pulls

up ACK signal to 1. Then the SENDER stops sending DATA, so DATA becomes

EMPTY. Finally, the RECEIVER pulls down ACK signal to 0 and the transfer is

(17)

2n

ACK

Data

Figure 2 (a) Dual-rail

Figure 2 (b) Dual-rail protocol

Valid DATA will appears separately. Dual-rail protocol uses EMPTY to

separate each DATA. When the SENDER sends one time of DATA, the data wires

will return to EMPTY. So the sequence of DATA is EMPTY – VALID – EMPTY –

(18)

Figure 3 Transfer diagram

2.1.3 Muller C-element

Muller C-element is a fundamental component in asynchronous circuit design.

It is a state-holding element just like an asynchronous set-reset latch. The function

of Muller C-element is shown in Table 2. When both inputs are logic 1, the output is

logic 1. When both inputs are logic 0, the output is logic 0. Otherwise, the output

does not change. Figure 4 (a) shows the symbol of Muller C-element and its gate

level implementation. It uses three AND gates and one OR gate. The output C will

return to the input and become one of the inputs. Figure 4 (b) shows the C-element

with reset signal.

Input 1 Input 2 Output

0 0 0

0 1 No change

1 0 No change

1 1 1 Table 2 the function of Muller C-element

(19)

Figure 4 (a) the symbol of Muller C-element and its gate level

A

B

C

reset

Figure 4 (b) C-element with reset

2.1.4 Asynchronous Pipeline

There are several asynchronous pipeline implementation styles have been

(20)

four-phase bundled data protocol [1]. It is implemented with C-elements and

inverters. Another important model is called micropipeline which uses two-phase

bundled data protocol, introduced by Ivan Sutherland in his Turing Award Lecture

[1]. Other asynchronous pipeline implementations use different circuits design

methods to replace the C-elements and latches. Because of the model we choose, we

will introduce the Muller pipeline.

Figure 5 (a) shows a four-phase bundled data pipeline which based on the

Muller pipeline. A Muller pipeline is used to generate local clock pulses. The clock

pulse generated in one stage with the pulses generated in the neighboring stages in a

carefully controlled manner. Figure 5 (a) shows a pipeline without data processing.

If a pipeline with data processing, the combinational circuits can be added between

the stages. In order to maintain correct behaviors matching delays have to be

inserted in the request signal paths. Figure 5 (b) shows a pipeline with data

(21)

Figure 5 (a) Four-phase bundle data Muller pipeline. C C C REQ ACK

REQ REQ REQ ACK ACK ACK Matching delay Matching delay

Figure 5 (b) Four-phase bundle data Muller pipeline with matching delay.

2.2 Data Transfers

This section describes three ways of data transfers: program controller I/O (polling

I/O), interrupt I/O and DMA I/O. The program controller I/O lets I/O devices

(22)

The cycle checking of CPU gets information of data and CPU starts to service. When

I/O devices need to transfer, the interrupt I/O uses the interrupt signal to request CPU that

starts to transfer by its order of priority. The DMA I/O uses device controller to transfer

directly between I/O devices and memory without CPU. Table 3 shows comparisons with

program controller, interrupt and DMA.

CPU Time

Polling I/O Long

Interrupt I/O Long

DMA I/O Short

Table 3 Comparison of data transfers

2.3 AMULET

This section introduces a series of the AMULET processors that contain the

AMULET1, the AMULET2e, the AMULET3i.

z AMULET1 [7]: The AMULET1 was presented in 1994. It is a test piece

demonstrating feasibility of a full custom asynchronous processor that is designed

with Sutherland’s two-phase micropipelined method. AMULET1 is just a

(23)

between on chip and off chip. AMULET1 is the ARM6 compatible processor.

z AMULET2e [8]: The AMULET2e was presented in 1996. It is an embedded

microprocessor with memory, cache, peripherals and AMULET2 core. AMULET2e

is designed with four-phase protocol. AMULET2e is the ARM7 compatible

processor.

z AMULET3i [9]: The AMULET3i was presented in 2000. It contains AMULE3 core

and other asynchronous marcocell. The AMULET3i marcocell consists of a number

of components and connected together by a marcocell bus called MARBLE BUS.

These components contain RAM, DMA controller and etc. AMULET3i is the ARM9

compatible processor.

2.3.1 AMULET3i DMA controller

The AMULET3i DMA controller is used primarily to transfer data between on chip

and off chip [2]. The DMA controller (Figure 6) contains: Register bank control,

ARBITER, Synchronous Peripheral Interface (SPI), Transfer Engine and Initiator

Interface. The Register bank control is used to perform the operations of the register bank

block. The ARBITER is used to decide the data that comes from the transfer engine or the

MARBLE bus. The data should be stored into the register bank block. The Synchronous

Peripheral Interface (SPI) is used to control the mapping of these incoming synchronous

(24)

channels. The SPI contains not only the channel mapping hardware but also request state

machines for each channel. The Transfer Engine is used to receive requests that come

from the SPI and the channel register values and sent them to the Initiator Interface. The

Transfer Engine has a main companion process: the Initiator Interface. The Initiator

Interface is used to actually perform the DMA transfer operation and also signals

end-of-transfer to register bank for finishing updating register bank registers. A DMA

transfer begins with DMA request arriving on SPI. The operations of a transfer should be

performed as follows. First, the SPI receives DMA request and maps it onto DMA

controller channel. Then, the Transfer Engine receives channel request and requests a

copy of the register contents for requesting channel from register bank control. After that

the initiator interface receives the channel register values from the transfer engine and

sends the source address and destination address to the initiator I/F for finishing transfers.

When the transfer would actually have been completed, the end-of-transfer signal to

register bank control and the channel register values of the register bank control should

(25)

Figure 6 DMA controller structure

2.4 Problem

The AMULET3i DMA controller allows only a certain number of transfers that

should be performed for each DMA transfer. Due to the image processing of today’s

Digital Signal Processor often requires a lot of image data and the size of fast on chip

memory is very limited, a lot of image data are stored in external memory. As frequency

and processing of DSP are increasing, the data rate fully utilize the available processing

bandwidth. In terms of results of experiment [10], the same data is transferred. The small

data size transfer can have perfect contrast efficiency with the general (large) data size

transfer. Due to all of the above, the AMULET3i DMA controller could produce longer

(26)

2.5 Comparison

Figure 7 (a) shows the channel register data is sent to the initiator interface from the

transfer engine. When the next channel register data will be sent to the initiator interface,

the current channel register data is completed in the initiator interface. So, the each

latency is produced between old channel register data and new channel register data from

the transfer engine to the initiator interface. In the general size data situation, the each

latency is ignored due to a transfer costs amount of time. But in the small size data

situation, the each latency is attended due to a transfer costs a little of time [11].

Figure 7 (b) is presented here to reduce each latency between old channel register data

and new channel register data.

(27)

Figure 7 (b) Architecture of two interfaces

Figure 8 (a) show the original DMA controller that the transfer engine requests three

times copies of the register contents and sends to the initiator interface. Every different

data has a setup time of the data of the initiator interface. Figure 8 (b) shows our new

DMA controller that contains two initiator interfaces. The first data has a setup time of

the data of the initiator interface. The setup time of the other data are overlapped. The

setup Time and DMA transfer time are overlapped. Therefore the latency between the

(28)

Figure 8 (a) Timeline of one interface

(29)

Chapter 3 Design

In previous chapter, we have compared advantages and disadvantages of asynchronous

circuits. In addition, we also describe ways of data transfer between external memory and

internal memory. We have described benefits of data transfer using AMULET3i Direct

Memory Access Controller for the processor in previous section. But there are still some

points that can be improved in lots of small size data and transfers in many times condition.

This chapter will describe with the architecture first, and it then talks about the overall

operation. Depending on the proposed architecture, the data transfer flow and data formats to

meet the requirements will be discussed. Finally, we will describe the design architecture in

detailed.

3.1 Data flow

The data flow have two parts: one is used in the overall design and the other is

used in Address_Interface of the design. The first part is shown in Figure 9. This flow

shows how to transfer the data in the overall design. The design receives the value of

channels from register bank block as input and decides the Counter that whether

(30)

Figure 9 Data flow 1

First, the request signal is sent to the DMA controller. If the request is arrival, the

Transfer Engine receives the data of channels. The Address_Interface receives the data from

the Transfer Engine. The Counter receives the data from the Address_Interface and computes

addresses. When the Count state of the Counter is “Yes” which indicates data transfer of the

channel should be finished. Otherwise, the Count state of the Counter is “No” which indicates

data transfer of the channel should continue transferring.

The other data flow shows how the data can be stored in the Address_Interface from the

(31)

Figure 10 Data flow 2

The data is sent by the Transfer Engine. First, The Address_Interface checks the Interface

is Empty or not. If the Interface is full with 1 symbol, the data cannot be stored into the

Interface. Otherwise, the data can be stored into the Interface. According to the state of the

Interface, the data is stored into one of two Interfaces. The 00 and 10 symbols indicate that the

data is stored into the Interface1 and the 01 symbol indicates that the data is stored into the

(32)

3.2 Register Formats

This paragraph describes that how we design register formats within a channel.

These register formats are transmitted from the register bank block. These register

formats how to be designed in the overall design.

The value of each channel contains four registers: SOURCE Address register,

DESTINATION Address register, COUNT register and CONTROL register. The

SOURCE register stores 32-bit source addresses for transfers, and the DESTINATION

register stores 32-bit destination addresses for transfers. The COUNT register stores

32-bit count of remaining transfers to perform. The transfer will be terminated when the

COUNT register is decremented to zero. The CONTROL register specifies the updates to

be performed on the other three registers and which this channel is connected.

Figure 11 (a) Source register

(33)

Figure 11 (c) Count register

Figure 11 (d) Control register

Figure 11 (e) Content of the control register

The CONTROL register has 3 fields: Channel_No, DRQ and Enable. If the Enable

bit is set, this channel should be considered for transfers when a new DMA request

arrives. When transfers of channel finish, the Enable bit should be reset to indicate that

this channel can receive a new DMA request. The DRQ bit indicate the initial state for the

request. If the DRQ bit is set, it means that the data have arrived really. Resetting the

DRQ bit specifies that request from corresponding channel data should trigger a transfer.

The Channel_No bits indicates which channel numbers receive the source and destination

(34)

3.3 Architecture

The architecture consists of three major constructions: Transfer Engine,

Address_Interface, Counter (Figure 12).

Figure 12 Architecture

The three parts have different functions: Transfer Engine is used to receiving the

values of channels, Address_Interface is used for sending the value of channels out and

Counter computes the value from Address_Interface. This architecture has two ports

which can receive the value of channels from register bank block, and it has one port to

send source address and destination address out to the bus. In general, the input speed is

faster than the output speed; Counter can compute next address and sent next address out

if current address transfer finishes. In addition, Counter can compute next address and

(35)

Initially, the design needs to confirm if the request signal arrives or not. If the

request signal is arrival, it will start to receive the values of channels. This work is done

in Transfer Engine.

First, the Transfer Engine receives the value of channels and sends the value of a

channel out to Address_Interface. The Address_Interface contains Interface_1 and

Interface_2, and the value are stored in empty the Interface. The Address_Interface sends

the value out to Counter, and the Transfer Engine send the value out to the

Address_Interface simultaneously. When the current channel finishes, the

Address_Interface sends the next value of the channel out to Counter immediately. These

will operate in our DMA controller design according to the above method.

3.3.1 Transfer Engine

Transfer Engine (Figure 13) has one input ports: Channel_Data_in and one output

port: Engine_Data_out. The Channel_Data_in is stored in the data of channels from

register bank block in AMULET3i DMA Controller architecture. First, Transfer Engine

receives the value of channels and stores it in the Engine Register. These should be

controlled by the control signal of the COMPARE_1. When the data of the Engine

(36)

Figure 13 Transfer Engine

The COMPARE_1 (Figure 14) is used to receive count that should be stored in the

Engine Register and ACK signal. The COMPARE_1 also decides the Channel_Data to

be stored into the Engine Registers or to reset the data of the Engine Register by the

control signal of the COMPARE_1. First, the Engine Register_count and ACK signal are

operated by the COMPARE_1 and produce COMPARE_1 control signal. When symbol

of COMPARE_1 is 0, the Channel_Data can be stored into the Engine Register. When

symbol of COMPARE_1 is 1, the Channel_Data cannot be stored into the Engine

Register and reset the data of the Engine Register. Figure 15 shows the gate-level

(37)

Figure 14 COMPARE_1

Figure 15 The gate level of Detection

The Engine Register (Figure 16) receives the Channel Data_in and COMPARE_1

control signal. When the control signal is 0 controlled by COMPARE_1 signal, four

MUX select the source_address, destination_address, count, and Ctrl separately and store

these data into four registers. On the other hand, and the signal is 1 controlled by

(38)

Figure 16 Engine Register

3.3.2 Address Interface

Address Interface (Figure 17) has two input ports: Counter_Data and Engine_Data and

one output port: Interface_Data. The Counter_Data comes from the COUNTER after

Interface_Data is computed every time. The Engine_Data from Transfer Engine are stored in

registers. The Interface_Data is sent out to the COUNTER and the COUNTER computes

source address, destination address and count. The SELECT_1 and SELECT_2 signal are

(39)

used to store the data. The SELECT_2 signals that is used to control MUX. The MUX is used

to decide which one of data of two Interfaces can be sent out.

Figure 17 Address Interface

By method of judgment of SELECT_1 and SELECT_2. The DECISION (Figure 18 (a))

is used to decide the Counter_Data that should be stored into which one of two Interfaces. It is

also used to decide Engine_Data that should be stored into which one of two Interfaces.The

Counter_Data and Engine_Data cannot be stored the same interface simultaneously. Figure 18

(b) shows components of the DECISION that contains two DEMUX and two MUX. The

DEMUX_1 decides Counter_Data that should be sent to the MUX_1 or the MUX_2. The

DEMUX_2 decides Engine_Data that should be sent to the MUX_1 or the MUX_2. The

MUX_1 decides the one of two data that should be sent to the Interface_1. The MUX_2

(40)

Figure 18 (a) DECISION

(41)

The SELECT_1 Generator (Figure 19) is used to receive count that should be stored in

two Interfaces. It also decides the Engine_Data that should be stored into which one of two

Interfaces or can not store into the Interfaces by SELECT_1[1:0] signal. First,

Interface1_count and Interface2_count are operated by the Detection and produce SELECT_1

[1:0] signal. If the SELECT_1 [1:0] signal is 00 or 10, the Engine_Data should be stored into

the Interface_1. When the SELECT_1 [1:0] signal is 01, the Engine_Data can be stored into

the Interface_2. But the SELECT_1 [1:0] signal is 11, the Engine_Data can not be stored into

both of the Interface_1 and the Interface_2.

Figure 19 SELECT_1 Generator

The SELECT_2 Generator (Figure 20) is used to receive the SELECT_1[1:0]

signal and Comp_Dect signal that comes from the COUNTER. The SELECT_2 signal

(42)

SELECT_2 signal. The Interface_Data is updated from the COUNTER. The

SELECT_2 signal also decides to select which one of the data of two Interfaces that

should be sent out. Initially, the SELECT_2 signal and Comp_Dect signal are 0. When

the Comp_Dect signal becomes 1, the SELECT_2 signal becomes 1 from 0 and later

the Comp_Dect signal becomes 0 from 1. When the Comp_Dect signal becomes 1

again, the SELECT_2 signal becomes 0 from 1 and later the Comp_Dect signal

becomes 0 from 1. So, the SELECT_2 signal changes 0 Æ 1 Æ 0 Æ 1, when the

Comp_Dect signal becomes 1.

(43)

The MUX (Figure 21 MUX) is used to receive Interface1_Data and Interface2_Data. The

MUX decides which one can be sent out to the COUNTER depending on the control signal

SELECT_2..

M

U

X

SELECT_2

Interface1_Data

Interface2_Data

Interface_Data

Figure 21 MUX

3.3.3 Counter

The COUNTER (Figure 22 (a)) is used to receive Interface_Data and ACK signal.

The COUNTER contains the counter to compute Interface_Data. The counter (Figure 22

(44)

source address, destination address and count. The counter produces Counter_Data. The

Counter_Data contains source address, destination address and count. The source address

indicates the source of a data transfer, the destination address indicates the destination of

a data transfer and the count indicates the number of times of data transfers. When a data

transfer finishes and the counter accept the ACK signal is 1, the source address and the

destination address are incremented by 4 and the count is decremented by 1. Figure 22 (c)

shows how to produce the Comp_Dect signal by Count and Ctrl of the Interface_Data.

The initial the Comp_Dect signal is 0. The count is equal to 0 if the transfer is finished.

The DRQ bit is is equal to 1 if the data arrives. When the count is 0 and the DRQ bit is 1,

the Comp_Dect signal becomes 1. If the Comp_Dect signal is 1, it indicates that current

transfer is finished. Then, when the next Interface_Data arrives and starts the new data

(45)

Figure 22 (a) COUNTER

(46)

(47)

Chapter 4 Simulation

In chapter 3, we have described the proposed design of function block in detail. We use

the Verilog HDL to build all of the components and construct the whole architecture with

above mentioned our sub-modules described in chapter 3.

We implemented our improved section of asynchronous DMA controller in gate-level.

The design was synthesized and simulated with the TSMC 0.13um library and the simulator is

ModelSim 6.0. The experimental group of our design is the new proposed architecture, and

the control group is the original Amulet3i DMA controller. We also compare our new DMAC

design with the original design of Amulet3i.

4.1 Area Simulation

The areas of each sub-modules are shown in Table 4. We can discover that the area of the

Address Interface is largest. This is because the Address Interface contains complicated

control signals and elements. The area of Transfer Engine is 3119.8 um2, the area of Address

Interface is 8261.2 um2 and the area of Counter is 6996.7 um2. The total area is 18501.7 um2.

(48)

Module Area(um2)

Transfer Engine 3119.8

Address Interface 8261.2

Counter 6996.7

Total 18501.7

Table 4 The improved architecture

Module Area(um2)

Transfer Engine 3119.8

Address Interface 2909.4

Counter 6908.4

Total 13054.7

Table 5 The original architecture

4.2 Timing Simulation

In this section, we have two simulation results: one is original architecture and the other

is our new architecture. The Figure 23 (a) shows the timing of the last data of the channel 1

with original architecture. Figure 23 (b) shows the timing of the first data of the channel 2

with original architecture. The Figure 24 (a) shows the timing of the last data of the channel 1

with the new architecture. Figure 24 (b) shows the timing of the first data of the channel 2

(49)

Figure 23 (a) The last data of the original architecture

(50)

Figure 24 (a) The last data of the improved architecture

(51)

Table 6 shows the latency comparison of original and new architecture. The latency means the

time between last data of current channel and first data of next channel. The result represents

new latency is less than original latency that the new architecture can reduce latency between

the different channels. Table 7 shows the comparison of ratio of latency in different count

conditions. The original time represents the time of data transfer of original initiator interface.

The new time represents the time of data transfer of new initiator interface. Because the new

design circuit is more complicated than the original design circuit, the new time is much

bigger than the original time from table 7. When the count is 5, the improvement is 9%. When

the count is 10, the improvement is 4.5%. When the count is 15, the improvement is 3%.

Therefore, the count much smaller and the improvement latency is more obvious.

Latency(ps)

Original 1026

New 104

(52)

Count = 5 Count = 10 Count = 15

Original Time(ps) 10075 20150 30225

Original Latency(ps) 1026(10%) 1026(5%) 1026(3.3%)

New Time(ps) 10570 21140 31710

New Latency(ps) 104(1%) 104(0.5%) 104(0.33%)

(53)

Chapter 5 Conclusion

Because the performance of DMA data transfer is the bottleneck of the multimedia

application performance, it’s important to improve the performance of DMA controller. In this

thesis, we propose an improved DMA controller of Amulet3i ones to reduce the latency of

DMA data transfer. We use this new proposed architecture to reduce the latency between the

transfer engine and the address interface with the different channels. We use asynchronous

circuits design to solve the problem of global clock.

We propose new architecture should be applied in asynchronous Digital Signal Processor.

The new latency reduced 90% than the original latency. In chapter 4, we can find this new

architecture has perfect performance in lots of smaller data size transfer condition. Due to

DSP processing should be overlapped with DMA processing, the new proposed architecture

and data size format use together not only have lower latency but also have better

performance. Especially, a sequence of small size data is transferred, the new improvement

(54)

References

[1] Jens SparsØ and Steve Furber,“Principles of Asynchronous Circuit Design”, London, 2001

[2] Andrew Bardsley, “Implementing Balsa Handshake Circuits”, Ph.D thesis, Department of

Computer Science, The University of Manchester, 2000

[3] F. Aghdasi and A. Bhasin, “DMA controller design using self-Clocked methodology”,

IEEE AFRICON, vol. 1 and 2, pp443-450, 2004

[4] Kim D., Managuli R. and Kim Y., “Data Cache and Direct Memory Access in

Programming MediaProcessors”, IEEE MICRO, vol. 21, pp. 33-42, 2001

[5] R. Y. Chen, M. Vijaykrishman and M. J. Irwin, “Clock Power Issues in System-on-a-Chip

Designs”, IEEE International Computer Society Workshop on VLSI, pp.48-53, 1999

[6] David Duarte, Vijaykrishman Narayanan and Mary Jane Irwin, “Impact of Technology

Scaling in the Clock System Power”, IEEE International Computer Society Annual

Symposium on VLSI, pp.59-64, 2002

[7] N.C. Paver, “The Design and Implementation of an Asynchronous Mircoprocessor”, Ph.D

thesis, Department of Computer Science, The University of Manchester, 1994

[8] S.B. Furber, P. Day, J.D. Garside, S. Temple, P. Day, J. Lin and N.C. Paver, “AMULET2e:

An Asynchronous Embedded Controller”, In the third International Symposium on

(55)

[9] J.D. Garside, S.B. Furber and S.H. Chung, “AMULET3 Revealed” , in the third

International Symposium on Advanced Research in Asynchronous Circuits and Systems,

ASYNC97, Department of Computer Science, pp.51-59, 1999

[10] D. Comisky, S. Agarwala and C. Fuoco, “A Scalable High-Performance DMA

Architecture for DSP Applications”, IEEE International Conference on Computer Design:

VLSI in Computers and Processors, pp.414-419 , 2000

[11] Zinner C. and Kubinger W., “ROS-DMA: A DMA Double Buffering Method for

Embedded Image Processing with Resource Optimized Slicing”, Proceedings of 12th

IEEE Real-Time and Embedded Technology and Applications Symposium, pp.361-372,

低延遲非同步直接記憶體存取控制器之設計

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

低 延 遲 非 同 步 直 接 記 憶 體 存 取 控 制 器 之 設 計

Low-Latency Design for Asynchronous DMA Controller

研 究 生：戴俊智

指導教授：陳昌居 教授

低延遲非同步直接記憶體存取控制器之設計

Low-Latency Design for Asynchronous DMA Controller

研 究 生：戴俊智 Student：Chun-Chih Tai

指導教授：陳昌居 Advisor：Chang-Jiu Chen

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

低延遲非同步直接記憶體存取控制器之設計

學生：戴俊智

指導教授：陳昌居教授

國立交通大學

資訊科學與工程研究所

摘要

Low-Latency Design for Asynchronous DMA

Controller

student：Chun-Chih Tai

Advisors：Prof. Chang-Jiu Chen

Department of Computer Science and Information Engineering

National Chiao Tung University

Abstract

Acknowledgement

Contents

List of Figures

List of Tables

Chapter 1 Introduction

Chapter 2 Related Works

2.1 Asynchronous Circuits Design

2.1.1 Advantages

2.1.2 Handshake protocols

SENDER

RECIEVER

2n

ACK

Data

2.1.3 Muller C-element

A

B

C

reset

2.1.4 Asynchronous Pipeline

2.2 Data Transfers

2.3 AMULET

2.3.1 AMULET3i DMA controller

2.4 Problem

2.5 Comparison

Chapter 3 Design

3.1 Data flow

3.2 Register Formats

3.3 Architecture

3.3.1 Transfer Engine

3.3.2 Address Interface

M

U

X

SELECT_2

Interface1_Data

Interface2_Data

Interface_Data

3.3.3 Counter

Chapter 4 Simulation

4.1 Area Simulation

4.2 Timing Simulation

Chapter 5 Conclusion

References

低延遲非同步直接記憶體存取控制器之設計

研究生：戴俊智

指導教授：陳昌居教授

研究生：戴俊智 Student：Chun-Chih Tai

國立交通大學

資訊科學與工程研究所

碩士論文