T HESIS O RGANIZATION - 適用於視訊應用的智慧型記憶體控制器設計

CHAPTER 1 INTRODUCTION

1.4 T HESIS O RGANIZATION

In Chapter 2, the characteristics of modern on-chip bus and DRAM are introduced. Chapter 3 presents our multimedia platform simulator and the memory controller scheduling policies implemented. Then, in Chapter 4, we show the simulation result and analysis. Chapter 5 implements a memory controller according to the simulation result of Chapter 4. Chapter 6 is the conclusion and future work.

Chapter 2 Overview of Modern On-chip Bus and

DRAM

This chapter is divided into two parts to introduce modern on-chip bus and DRAM. In Section 2.1, a modern on-chip bus specification is presented. After that, in Section 2.2, we introduce the DRAM basics and how to measure DRAM power consumption in system level.

2.1 Advance Microcontroller Bus Architecture (AMBA)

The AMBA protocol [12] is an open standard, on-chip bus specification which is drawn up by ARM Limited. It is the most popular on-chip bus standard in the world now.

The latest version of AMBA is 3.0. It is also called Advanced eXtensible Interface (AXI) [13]. AXI is first brought up in Embedded Professor Forum (EPF), 2003 and its version 1.0 specification is then announced in March, 2004. The most distinct feature of AXI is the out-of-order transaction that makes it ideal for the high performance system and relaxes the constraints to the memory controller.

2.1.1 AXI Architecture

Fig. 2-1 shows a generic AXI architecture. There are five independent channels in charge of communication between the master and slave. The five channels are write address channel, read address channel, write data channel, write response channel, and read data channel respectively. Each channel contains a set of forward signals and one feedback signal. The feedback READY signal is used to cooperate with the forward VALID signal to perform channel handshaking for data and control information transfer. Channel handshaking will be stated later.

AXI

Fig. 2-1 Generic AXI architecture

When the master initiates a read transaction, it sends address and control information to the slave by read address channel. When the slave receives address and control information, it starts to work. After the slave finishes its task, data are sent back to the master via read data channel. The read transaction is not done until the last burst data is accepted by the master.

As to a write transaction, the master sends address and control information to the slave by write address channel first. Then, the master provides data required for the slave via write data channel. Finally, after the slave finishes its task, a response is sent back through write response channel. The master checks the response to see if the write transaction succeeds. Fig. 2-2 presents the process of read transactions and write transactions respectively.

AXI Master

AXI Slave Address

and Control

Read Data

Read Address Channel

Read Data

Read Data Read Data Channel

(a)

(b)

Fig. 2-2 (a) Channel architecture of reads (b) Channel architecture of writes

2.1.2 Channel Handshaking

All five channels use VALID/READY handshaking to transfer data and control information. This mechanism enables both the master and slave to control transfer rate of data and control information. The source raises the VALID signal to indicate that data or control information is available. The destination raises the READY signal

to indicate that data or control information can be accepted. Transfer occurs only when VALID and READY signals are both HIGH.

(a)

(b)

(c)

Fig. 2-3 (a) VALID before READY (b) READY before VALID (c) VALID with READY

Fig. 2-3 shows all possible cases in VALID/READY handshaking. Note that the source provides valid data and control information and drives VALID signal HIGH simultaneously. The arrow in Fig. 2-3 indicates when the transfer occurs.

2.1.3 Transaction Ordering

Unlike AMBA 2.0 in which only one granted transaction can use the common system bus interconnect until it is finished, AXI separates channel relations and uses

“ID tag” to enable out-of-order transaction completion.

Out-of-order transactions improve system performance in two ways:

● Bus interconnect can enable transactions with fast-responding slaves to complete in advance of earlier transactions with slower slaves.

● Complex slaves can return read data out of order. For example, data for a later transaction might be available in internal buffer before data for an earlier transaction is available.

Although AXI supports out-of-order transactions, it doesn’t mean that transactions can be reordered at pleasure. The basic rule is “Transactions with same ID tag must be completed in order”. That is, if a master requires transactions to be completed in the same order as they are issued, the master must assert these transactions with the same ID tag. If, however, a master does not require in-order transaction completion, it can supply transactions with different ID tags.

The rule stated before is just for the single master system. In a multi-master system, the bus interconnect has to append additional information to the ID tag to ensure that ID tags are unique from all masters.

2.1.4 Additional Features

● Burst types

AXI supports three different burst types which are suitable for:

¾ Normal memory accesses

¾ Wrapping cache line bursts

¾ Streaming data to peripheral FIFO locations

● System cache support

The cache-support signal of AXI enables a master to provide to a system-level cache the bufferable, cacheable, and allocate attributes of a transaction.

● Protection unit support

To enable both privileged and secure accesses, AXI provides three levels of protection unit support.

● Atomic operations

AXI defines mechanisms for both exclusive and locked accesses.

● Error support

AXI provides error support for both address decode errors and slave-generated errors.

● Unaligned address

AXI supports unaligned burst start addresses to enhance the performance of the initial accesses within a burst.

2.2 Modern DRAM

Modern DRAM takes high initialization cost for each new burst access due to its operating characteristics. Thus, how to minimize such cost is an important issue for the memory controller.

2.2.1 DRAM Basics

Fig. 2-4 Simplified DRAM architecture

Fig. 2-4 is a simplified DRAM architecture. In general, there are four bank of memory arrays with corresponding row and column decoders in one memory chip.

Each bank of memory array consists of rows and each row consists of columns. The data width of one column equals to that of DRAM data bus. DRAM density size is the multiplication of the bank number in one chip, the row number in one bank, the column number in one row, and the data width of one column.

When there is a read or write access, the accessed row must be loaded to sense amplifiers of the corresponding bank first. Then, columns are read from or written to the sense amplifiers. If the next access is to the same bank and row, columns can be accessed directly without reloading the row. However, if the next access is to different row in the same bank, DRAM has to write current row back to memory array from sense amplifiers and load needed one.

Mode register stores DRAM settings including burst length, burst type, CAS latency, and etc. It should be configured during power-up initialization.

The memory array stores data in small capacitors which lose charge over time. In order to retain data integrity, DRAM needs to recharge these capacitors. This process is done by loading data to sense amplifiers and writing back row by row. The refresh counter is used to generate row addresses necessary.

2.2.2 DRAM Operations

Now, we start to introduce DRAM operations in terms used in JEDEC standard [14][15]. Since there are slight differences between each DRAM type, we take DDR SDRAM as a representative.

A. Activation

When the state of a bank is idle, a row must be “opened” before any READ or WRITE command can be issued to that bank. Opening one row is to load the row from memory array to sense amplifiers. This operation is accomplished by ACTIVE command.

After the ACTIVE command, tRCD is required before a READ or WRITE command to that row to be issued. A subsequent ACTIVE command to a different row in the same bank can not be issued until the active row has been “closed” which

takes at least tRC. However, a subsequent ACTIVE command to another bank just needs a tRRD latency.

B. Read

The read burst is initiated by a READ command with the bank and starting column address. During a read burst, the first valid data-out element from the starting column address will be available following CAS latency after the READ command is issued.

C. Write

The write burst is initiated with a WRITE command and the bank and starting column address. During a write burst, the first valid data-in element will be registered following tDQSS after the WRITE command.

After the last valid data-in element is registered, tWTR is required before a READ command to any bank and tWR before a PRECHARGE command to the same bank.

D. Precharge

This operation writes the active row in sense amplifiers back to memory array.

The bank will be available for a subsequent row activation tRP after the PRECHARGE command is issued.

E. Refresh

Refresh operation retains data integrity in memory array. AUTO REFRESH command is used to initiate this operation every tREFC interval and tRFC should be met between two successive AUTO REFRESH commands. Note that, AUTO REFRESH command can only be issued when all banks are idle.

F. Power-down

In DDR SDRAM standard, there are three power-down modes. These modes are precharge power-down, active power-down, and self refresh.

Precharge power-down is entered when CKE is registered LOW and all banks are idle. Active power-down is entered when CKE is registered LOW and there is a row active in any bank. Self refresh is entered when CKE is registered LOW with all banks idle and AUTO REFRESH command.

Precharge power-down and active power-down do not refresh memory array automatically, so the power-down duration is limited by tREFC. However, self refresh does not have such limitation.

Since precharge power-down and active power-down disable less functional units, they save less power while cost only several cycles to return original state. Self refresh disables almost all functional units, so it saves more power at the expense of several hundred return cycles.

Idle Active

Active

Precharge

Read / Write

Idle Active

Active

Precharge

Read / Write

Bank0

Bank3

●

● Refresh /

Power-down

Fig. 2-5 Simplified DRAM state diagram

Scope Parameter Symbol

ACTIVE to READ or WRITE delay tRCD

ACTIVE to PRECHARGE command tRAS

ACTIVE to ACTIVE command tRC

WRITE to first DQS latching transition tDQSS

Write recovery time tWR

Affect single bank

PRECHARGE command period tRP

ACTIVE bank a to ACTIVE bank b command tRRD Last write data to READ command delay tWTR Longest tolerable refresh interval tREFC Affect all banks

AUTO REFRESH command period tRFC

Table 2-1 Key DDR SDRAM timings

Fig. 2-5 shows a simplified DRAM state diagram to provide a clear relationship between each operation. Table 2-1 lists key DDR SDRAM timings.

2.2.3 DRAM Power Calculation

Jeff Janzen proposed Calculating Memory System Power for DDR SDRAM in Micron designline, quarter 2, 2001 [16]. This article analyzes how DDR SDRAM consumes power and develops a method to calculate memory system power. This method can help memory sub-system power consumption estimation in high-level system evaluation before low-level hardware implementation.

According to DDR SDRAM operations, memory system power consists of precharge power-down power, precharge standby power, active power-down power, active standby power, activate power, write power, read power, I/O power, and refresh power. Table 2-2 is the IDD specifications which can be looked up in data sheet.

Table 2-3 is the parameters defined for equations in this article. All these parameters are used in power consumption calculation.

Parameter / Condition Symbol

OPERATING CURRENT: One bank; Active Precharge; tRC = tRC

MIN; tCK = tCK MIN IDD0

PRECHARGE POWER-DOWN STANDBY CURRENT: All banks

idle; Power-down mode; tCK = tCK MIN; CKE = LOW I_DD2P IDLE STANDBY CURRENT: CS = HIGH; All banks idle; tCK = tCK

MIN; CKE = HIGH I_DD2F

ACTIVE POWER-DOWN STANDBY CURRENT: One bank;

Power-down mode; tCK = tCK MIN; CKE = LOW IDD3N ACTIVE STANDBY CURRENT: CS = HIGH; One bank; tCK = tCK

MIN; CKE = HIGH IDD3N

OPERATING CURRENT: Burst = 2; READs; Continuous burst; One

bank active; tCK = tCK MIN; IOUT = 0 mA I_DD4R OPERATING CURRENT: Burst = 2; WRITEs; Continuous burst; One

bank active; tCK = tCK MIN IDD4W

AUTO REFRESH CURRENT; tRC = 15.625 ms I_DD5

Table 2-2 IDD specifications used in power consumption calculation

Parameter Description

VDDsys VDD at which the system drives DDR SDRAM.

FREQsys Frequency at which the system applies to DDR SDRAM.

p(perDQ) Output power of a single DQ.

BNK_PRE% Percentage of time all banks are precharged.

CKE_LO_PRE% Percentage of precharge time that CKE is LOW.

CKE_LO_ACT% Percentage of active time that CKE is LOW.

tACT Average time between ACTIVE commands.

RD% Percentage of time that output reads data.

WR% Percentage of time that input writes data.

num_of_DQ Number of DDR SDRAM DQ pins num_of_DQS Number of DDR SDRAM DQS pins

Table 2-3 Parameters defined for equations

Fig. 2-6 shows the current usage on a DDR SDRAM device as CKE transitions.

The current profile illustrates how to calculate precharge power-down and precharge standby power. Similarly, active power-down and active standby power can be calculated.

Fig. 2-6 Precharge power-down and standby current [16]

Precharge power-down power

p(PRE_PDN) = IDD2P * VDD * BNK_PRE% * CKE_LO_PRE%

Precharge standby power

p(PRE_STBY) = IDD2F * VDD * BNK_PRE% * (1 – CKE_LO_PRE%)

Active power-down power

p(ACT_PDN) = IDD3P * VDD * (1 – BNK_PRE%) * CKE_LO_ACT%

Active standby power

p(ACT_STBY) = IDD3N * VDD * (1 – BNK_PRE%) * (1 – CKE_LO_ACT%)

Fig. 2-7 Activate current [16]

In Fig. 2-7, it is obvious that each pair of ACTIVE and PRECHARGE command consumes the same energy. Thus, activate power can be calculated by dividing total energy of all ACTIVE-PRECHARGE pairs by time.

Activate power

p(ACT) = (IDD0 – IDD3N) * tRC(spec) * VDD / tACT

Fig. 2-8 Write current [16]

Fig. 2-8 shows that IDD4W is required for write data input.

Write power

p(WR) = (IDD4W – IDD3N) * VDD * WR%

Fig. 2-9 Read current with I/O power [16]

In Fig. 2-9, since the DRAM device drives external logics for read data output during a read access, extra I/O power is needed.

Read power

p(RD) = (IDD4R – IDD3N) * VDD * RD%

I/O power

p(DQ) = p(perDQ) * (num_of_DQ + num_of_DQS) * RD%

The last power component is refresh power and its equation is shown below.

Refresh power

p(REF) = (IDD5 – IDD2P) * VDD

So far, all equations use IDD measured in the operating condition listed in data sheet. However, the actual system may apply VDD and operating frequency other than those used in data sheet. Thus, the former equations have to be scaled by voltage supply and operating frequency.

P(PRE_PDN) = p(PRE_PDN) * (use VDD)² / (spec VDD)² P(ACT_PDN) = p(ACT_PDN) * (use VDD)² / (spec VDD)²

P(PRE_STBY) = p(PRE_STBY) * (use freq)² / (spec freq)² * (use VDD)² / (spec VDD)²

P(ACT_STBY) = p(ACT_STBY) * (use freq)² / (spec freq)² * (use VDD)² / (spec VDD)²

P(ACT) = p(ACT) * (use VDD)² / (spec VDD)²

P(WR) = p(WR) * (use freq)² / (spec freq)² * (use VDD)² / (spec VDD)² P(RD) = p(RD) * (use freq)² / (spec freq)² * (use VDD)² / (spec VDD)² P(DQ) = p(DQ) * (use freq)² / (spec freq)²

P(REF) = p(REF) * (use VDD)² / (spec VDD)²

Then, sum up each scaled power component to get total power consumption.

P(TOTAL) = P(PRE_PDN) + P(PRE_STBY) + P(ACT_PDN) + P(ACT_STBY) + P(ACT) + P(WR) + P(RD) + P(DQ) + P(REF)

Chapter 3 Multimedia Platform Modeling

In this chapter, the development of multimedia platform simulator is introduced.

Section 3.1 is a brief introduction of why we need a simulator. Section 3.2 presents a generic multimedia platform for modeling. In Section 3.3, 3.4, and 3.5, each portion of the simulator is described.

3.1 Introduction

When starting to build our simulation environment, a key problem is how to balance coding time, flexibility, simulation speed, and accuracy of the simulator.

HDL does not seem to be a good choice. First, it is developed in hardware view and that is, more regularity and less flexibility. Coding in hardware level has to follow a lot of constraints, so more coding time is required and the parameterization is bounded. Second, hardware implementation considers all signals. However, we only take care about some of them. Thus, eliminating useless parts to further speed up the simulator is more favorable.

Is there a simple solution to provide short coding time, good flexibility, fast simulation speed, and most important, fine accuracy? As a result, SystemC [17] is chosen to construct our simulation environment.

SystemC provides hardware-oriented constructs within the context of C++ as a class library implemented in standard C++. Also, SystemC provides an interoperable modeling platform which enables the development and exchange of very fast system-level C++ models. Thus, we can use C++ to implement signal-simplified simulator while keeping cycle accuracy.

3.2 Multimedia Platform

A generic multimedia SoC platform is shown in Fig. 3-1. There are 8 masters and 1 slave connected by the AXI bus interconnect. The 8 masters are CPU, DSP, accelerator, network, video in, video out, audio in, audio out and the only one slave is

the memory controller. CPU, DSP, and accelerator are the main data processing units.

Network, video in, video out, audio in, and audio out are bridges to peripherals which communicate internal and external data exchange. The memory controller serves 8 masters to access data in the off-chip DRAM.

Fig. 3-1 Generic multimedia SoC platform

Fig. 3-2 shows the multimedia platform simulator block diagram. The scenario driver initiates one session of accesses of a master by enabling the corresponding master enable signal. One session of accesses means that the master generates transactions for data accesses according to its access pattern by one iteration. 8 different access patterns are used to model behaviors of each master in the generic multimedia SoC platform shown in Fig. 3-1.

Fig. 3-2 Multimedia platform simulator block diagram

All data accesses conform to AXI protocol. However, to ease the development of simulator, we simplify the two AXI address channels, read and write, into one. This simplification does not affect AXI protocol compliance. The AXI network is responsible for channel arbitration with two common arbitration schemes, fixed priority and round-robin.

The memory controller connects a simplified memory model. The memory model removes unnecessary operations such as refresh and power-down, and simplifies the input/output interface to facilitate using.

3.3 Master Modeling

Modeling a master can be thought as generating transactions after its behavior.

According to AXI protocol, one transaction must possess at least four features which are ID, access type, destination address, and data to write. Here, the methods we use to generate transactions in our multimedia platform simulator are introduced.

3.3.1 ID Generation

Fig. 3-3 ID tag format

Since the multimedia platform is a multi-master platform, master information should be appended to ID tags to ensure their uniqueness.

We use 8-bit ID tags in the simulator and the format is shown in Fig. 3-3. The most significant 3 bits are master ID and the rest 5 bits are transaction ID.

Although transactions of the same master are done in order in our simulator and thus the transaction ID is useless, we still provide each transaction a transaction ID for simulator functionality correctness check.

3.3.2 Type and Address Generation

According to DRAM operating characteristics, it is obvious that transaction type and address affect DRAM access performance most. Thus, the transaction type and address generation is most important in master modeling.

To generate transactions, an intuitive way is building behavioral model for each master. Although this method is most precise, implementation of each master is time-consuming. For efficiency and flexibility, we use a configurable transaction generator instead.

The configurable transaction generator supports three access types and three address types. The three access types are read, write, and no operation. The three address types are 1-D, 2-D, and constraint random.

Fig. 3-4 shows how addresses are generated by the three address types. Fig. 3-4(a) is the 1-D address type which increases the address from base address by a fixed offset. The offset is determined by the size of data transferred in one access. Most masters in the multimedia platform shown in Fig. 3-1 use 1-D address type.

Fig. 3-4(b) is the 2-D address type. Unlike the 1-D address type, there are one start address and one end address in a row. Thus, the address cannot be increased directly until the end of row. When the end address of a row is met, the address jumps to the start address of the next row and goes on increasing. The boundary between the

在文檔中適用於視訊應用的智慧型記憶體控制器設計 (頁 12-0)