• 沒有找到結果。

CHAPTER 2 OVERVIEW OF MODERN ON-CHIP BUS AND DRAM

2.2 M ODERN DRAM

2.2.3 DRAM Power Calculation

Jeff Janzen proposed Calculating Memory System Power for DDR SDRAM in Micron designline, quarter 2, 2001 [16]. This article analyzes how DDR SDRAM consumes power and develops a method to calculate memory system power. This method can help memory sub-system power consumption estimation in high-level system evaluation before low-level hardware implementation.

According to DDR SDRAM operations, memory system power consists of precharge power-down power, precharge standby power, active power-down power, active standby power, activate power, write power, read power, I/O power, and refresh power. Table 2-2 is the IDD specifications which can be looked up in data sheet.

Table 2-3 is the parameters defined for equations in this article. All these parameters are used in power consumption calculation.

Parameter / Condition Symbol

OPERATING CURRENT: One bank; Active Precharge; tRC = tRC

MIN; tCK = tCK MIN IDD0

PRECHARGE POWER-DOWN STANDBY CURRENT: All banks

idle; Power-down mode; tCK = tCK MIN; CKE = LOW IDD2P IDLE STANDBY CURRENT: CS = HIGH; All banks idle; tCK = tCK

MIN; CKE = HIGH IDD2F

ACTIVE POWER-DOWN STANDBY CURRENT: One bank;

Power-down mode; tCK = tCK MIN; CKE = LOW IDD3N ACTIVE STANDBY CURRENT: CS = HIGH; One bank; tCK = tCK

MIN; CKE = HIGH IDD3N

OPERATING CURRENT: Burst = 2; READs; Continuous burst; One

bank active; tCK = tCK MIN; IOUT = 0 mA IDD4R OPERATING CURRENT: Burst = 2; WRITEs; Continuous burst; One

bank active; tCK = tCK MIN IDD4W

AUTO REFRESH CURRENT; tRC = 15.625 ms IDD5

Table 2-2 IDD specifications used in power consumption calculation

Parameter Description

VDDsys VDD at which the system drives DDR SDRAM.

FREQsys Frequency at which the system applies to DDR SDRAM.

p(perDQ) Output power of a single DQ.

BNK_PRE% Percentage of time all banks are precharged.

CKE_LO_PRE% Percentage of precharge time that CKE is LOW.

CKE_LO_ACT% Percentage of active time that CKE is LOW.

tACT Average time between ACTIVE commands.

RD% Percentage of time that output reads data.

WR% Percentage of time that input writes data.

num_of_DQ Number of DDR SDRAM DQ pins num_of_DQS Number of DDR SDRAM DQS pins

Table 2-3 Parameters defined for equations

Fig. 2-6 shows the current usage on a DDR SDRAM device as CKE transitions.

The current profile illustrates how to calculate precharge power-down and precharge standby power. Similarly, active power-down and active standby power can be calculated.

Fig. 2-6 Precharge power-down and standby current [16]

Precharge power-down power

p(PRE_PDN) = IDD2P * VDD * BNK_PRE% * CKE_LO_PRE%

Precharge standby power

p(PRE_STBY) = IDD2F * VDD * BNK_PRE% * (1 – CKE_LO_PRE%)

Active power-down power

p(ACT_PDN) = IDD3P * VDD * (1 – BNK_PRE%) * CKE_LO_ACT%

Active standby power

p(ACT_STBY) = IDD3N * VDD * (1 – BNK_PRE%) * (1 – CKE_LO_ACT%)

Fig. 2-7 Activate current [16]

In Fig. 2-7, it is obvious that each pair of ACTIVE and PRECHARGE command consumes the same energy. Thus, activate power can be calculated by dividing total energy of all ACTIVE-PRECHARGE pairs by time.

Activate power

p(ACT) = (IDD0 – IDD3N) * tRC(spec) * VDD / tACT

Fig. 2-8 Write current [16]

Fig. 2-8 shows that IDD4W is required for write data input.

Write power

p(WR) = (IDD4W – IDD3N) * VDD * WR%

Fig. 2-9 Read current with I/O power [16]

In Fig. 2-9, since the DRAM device drives external logics for read data output during a read access, extra I/O power is needed.

Read power

p(RD) = (IDD4R – IDD3N) * VDD * RD%

I/O power

p(DQ) = p(perDQ) * (num_of_DQ + num_of_DQS) * RD%

The last power component is refresh power and its equation is shown below.

Refresh power

p(REF) = (IDD5 – IDD2P) * VDD

So far, all equations use IDD measured in the operating condition listed in data sheet. However, the actual system may apply VDD and operating frequency other than those used in data sheet. Thus, the former equations have to be scaled by voltage supply and operating frequency.

P(PRE_PDN) = p(PRE_PDN) * (use VDD)2 / (spec VDD)2 P(ACT_PDN) = p(ACT_PDN) * (use VDD)2 / (spec VDD)2

P(PRE_STBY) = p(PRE_STBY) * (use freq)2 / (spec freq)2 * (use VDD)2 / (spec VDD)2

P(ACT_STBY) = p(ACT_STBY) * (use freq)2 / (spec freq)2 * (use VDD)2 / (spec VDD)2

P(ACT) = p(ACT) * (use VDD)2 / (spec VDD)2

P(WR) = p(WR) * (use freq)2 / (spec freq)2 * (use VDD)2 / (spec VDD)2 P(RD) = p(RD) * (use freq)2 / (spec freq)2 * (use VDD)2 / (spec VDD)2 P(DQ) = p(DQ) * (use freq)2 / (spec freq)2

P(REF) = p(REF) * (use VDD)2 / (spec VDD)2

Then, sum up each scaled power component to get total power consumption.

P(TOTAL) = P(PRE_PDN) + P(PRE_STBY) + P(ACT_PDN) + P(ACT_STBY) + P(ACT) + P(WR) + P(RD) + P(DQ) + P(REF)

Chapter 3

Multimedia Platform Modeling

In this chapter, the development of multimedia platform simulator is introduced.

Section 3.1 is a brief introduction of why we need a simulator. Section 3.2 presents a generic multimedia platform for modeling. In Section 3.3, 3.4, and 3.5, each portion of the simulator is described.

3.1 Introduction

When starting to build our simulation environment, a key problem is how to balance coding time, flexibility, simulation speed, and accuracy of the simulator.

HDL does not seem to be a good choice. First, it is developed in hardware view and that is, more regularity and less flexibility. Coding in hardware level has to follow a lot of constraints, so more coding time is required and the parameterization is bounded. Second, hardware implementation considers all signals. However, we only take care about some of them. Thus, eliminating useless parts to further speed up the simulator is more favorable.

Is there a simple solution to provide short coding time, good flexibility, fast simulation speed, and most important, fine accuracy? As a result, SystemC [17] is chosen to construct our simulation environment.

SystemC provides hardware-oriented constructs within the context of C++ as a class library implemented in standard C++. Also, SystemC provides an interoperable modeling platform which enables the development and exchange of very fast system-level C++ models. Thus, we can use C++ to implement signal-simplified simulator while keeping cycle accuracy.

3.2 Multimedia Platform

A generic multimedia SoC platform is shown in Fig. 3-1. There are 8 masters and 1 slave connected by the AXI bus interconnect. The 8 masters are CPU, DSP, accelerator, network, video in, video out, audio in, audio out and the only one slave is

the memory controller. CPU, DSP, and accelerator are the main data processing units.

Network, video in, video out, audio in, and audio out are bridges to peripherals which communicate internal and external data exchange. The memory controller serves 8 masters to access data in the off-chip DRAM.

Fig. 3-1 Generic multimedia SoC platform

Fig. 3-2 shows the multimedia platform simulator block diagram. The scenario driver initiates one session of accesses of a master by enabling the corresponding master enable signal. One session of accesses means that the master generates transactions for data accesses according to its access pattern by one iteration. 8 different access patterns are used to model behaviors of each master in the generic multimedia SoC platform shown in Fig. 3-1.

Fig. 3-2 Multimedia platform simulator block diagram

All data accesses conform to AXI protocol. However, to ease the development of simulator, we simplify the two AXI address channels, read and write, into one. This simplification does not affect AXI protocol compliance. The AXI network is responsible for channel arbitration with two common arbitration schemes, fixed priority and round-robin.

The memory controller connects a simplified memory model. The memory model removes unnecessary operations such as refresh and power-down, and simplifies the input/output interface to facilitate using.

3.3 Master Modeling

Modeling a master can be thought as generating transactions after its behavior.

According to AXI protocol, one transaction must possess at least four features which are ID, access type, destination address, and data to write. Here, the methods we use to generate transactions in our multimedia platform simulator are introduced.

3.3.1 ID Generation

Fig. 3-3 ID tag format

Since the multimedia platform is a multi-master platform, master information should be appended to ID tags to ensure their uniqueness.

We use 8-bit ID tags in the simulator and the format is shown in Fig. 3-3. The most significant 3 bits are master ID and the rest 5 bits are transaction ID.

Although transactions of the same master are done in order in our simulator and thus the transaction ID is useless, we still provide each transaction a transaction ID for simulator functionality correctness check.

3.3.2 Type and Address Generation

According to DRAM operating characteristics, it is obvious that transaction type and address affect DRAM access performance most. Thus, the transaction type and address generation is most important in master modeling.

To generate transactions, an intuitive way is building behavioral model for each master. Although this method is most precise, implementation of each master is time-consuming. For efficiency and flexibility, we use a configurable transaction generator instead.

The configurable transaction generator supports three access types and three address types. The three access types are read, write, and no operation. The three address types are 1-D, 2-D, and constraint random.

Fig. 3-4 shows how addresses are generated by the three address types. Fig. 3-4(a) is the 1-D address type which increases the address from base address by a fixed offset. The offset is determined by the size of data transferred in one access. Most masters in the multimedia platform shown in Fig. 3-1 use 1-D address type.

Fig. 3-4(b) is the 2-D address type. Unlike the 1-D address type, there are one start address and one end address in a row. Thus, the address cannot be increased directly until the end of row. When the end address of a row is met, the address jumps to the start address of the next row and goes on increasing. The boundary between the start and end address is usually fixed and preset. The 2-D address type is used in modern block-based video encoding and decoding, such as MPEG-2, MPEG-4, and H.264.

Fig. 3-4(c) is the constraint random address type which generates address randomly in the master mapping space. CPU accesses are always in such kind of address type.

(a)

(b)

(c)

Fig. 3-4 (a) 1-D address type (b) 2-D address type (c) Constraint random address type

Fig. 3-5 A simple transaction generation example

Fig. 3-5 gives a simple example to illustrate how the configurable transaction generator works. The transaction generator is configured by a file in the format shown in bottom of Fig. 3-5. The first column is the transaction type which includes read,

write, and no operation. The second is the number of data to be accessed. The third and forth are address type and base address respectively.

In this example, the transaction generator has three states (The number of states equals the number of rows in the configuration file).

The first state generates 32 data writes in 1-D address type from base address 0x2000 and the number of write transactions depends on the preset burst length. Once a write transaction is generated, until it is sent out via the AXI network, the next one cannot be generated. Thus, it may take over 32 cycles in this state. Since this state is a write state, all transactions are followed by their write data. The write data generation will be stated later.

The second state is a NOP state and it doesn’t generate any transaction. Thus, the configurable transaction generator idles for 8 cycles.

The last state is a read state and it works like the write state described before except that it doesn’t have to generate write data.

3.3.3 Write Data Generation

Every time a write transaction is generated, the corresponding write data is also brought out. Although the content of write data has nothing with DRAM access performance, we still define a write data format which is shown in Fig. 3-6 to ease the simulator functionality correctness check.

As presented in Fig. 3-6, the first column is the ID tag of the write transaction.

The second is a hexadecimal string “data”. The last one is the order of the write data in a burst. For example, the write data of a write transaction with ID tag 0x10 and burst length 2 are “0x10data0” and “0x10data1”.

Fig. 3-6 Write data format

3.4 AXI Network

Since the AXI channels are uncorrelated, the AXI network arbitration can be implemented to each channel independently. As to the arbitration schemes, fixed priority and round-robin are used in our simulator.

It is shown in Fig. 3-7 that master 0 always takes highest priority and master 7 always takes lowest when the fixed priority scheme is applied.

Fig. 3-8 shows the round-robin scheme. It is composed of eight states and the state changes each time when the arbitration is done. In the first state, master 0 takes the highest priority and master 7 takes the lowest. In the second state, master 1 takes the highest priority and master 0 takes the lowest and so on. Thus, in the last state, master 7 takes the highest priority and master 6 takes the lowest.

Fig. 3-7 Fixed priority arbitration scheme

Fig. 3-8 Round-robin arbitration scheme

3.5 Memory Controller

There are two type of memory controllers implemented. The difference is whether it supports bank-interleaving or not.

3.5.1 Memory Controller without Bank-interleaving Support

Fig. 3-9 Block diagram of the memory controller without bank-interleaving support

Fig. 3-9 is the block diagram of the memory controller without bank-interleaving support. It uses six processes and each process will be described below.

Every time the bus_get_req receives a transaction ID, type, and address, it stores them in the bus_get_req buffer. Also, when the bus_get_wdata receives one set of transaction ID and write data, it stores them in the bus_get_wdata buffer.

The trans_schedule determines the order of transactions in the bus_get_req buffer executed by the bank_ctrl. We implement three different transaction scheduling policies which will be stated in Section 3.5.3.

After the trans_schedule reorders the input transactions, the bank_ctrl translates these transactions into DRAM commands. The translation depends not only on the transaction type and address, but also on the current bank state. Fig. 3-10 shows all three possible cases in the transaction to command translation.

(a)

(b)

(c)

Fig. 3-10 Bank state transition and related commands when (a) current bank state is idle (b) current bank state is active with row hit (c) current bank state is active with row miss

In Fig. 3-10(a), the current bank state is idle and an ACTIVE command is issued to open a row for the transaction access.

In Fig. 3-10(b), the current bank state is active and the transaction accesses the opened row, that is, row hit. Thus, a READ or WRITE command can be issued directly without opening a row.

In Fig. 3-10(c), the current bank state is active and the transaction accesses a row other than the opened row, that is, row miss. Before a READ or WRITE command can be issued, the PRECHARGE and ACTIVE command must be applied to close the current row and open the wanted row.

It is obvious that the row hit case needs fewest commands to finish a transaction while the row miss case requires most commands. To finish a transaction using fewer commands represents better DRAM access performance and less DRAM power consumption. The transaction scheduling by the trans_schedule is to increase row hit cases and reduce row miss cases.

When a command is generated, the bank_ctrl puts necessary information on the command and address bus. Then it triggers the memory model by the bank_ena signal.

If the transaction is a write transaction, bank_ctrl gets the corresponding write data via ID matching from the bus_get_wdata buffer and puts it on the data bus after the memory model is triggered with a WRITE command. When the write transaction is finished, bank_ctrl triggers the bus_send_resp process by the send_resp_ena signal to send the master a response.

If the transaction is a read transaction, bank_ctrl gets read data from data bus after the memory model is triggered with a READ command. Then, it utilizes the bus_send_rdata process to send read data to the AXI network.

For simplicity, we assume that masters are always ready to receive the response after issuing a write transaction and read data after issuing a read transaction. Thus, no output buffer is needed.

3.5.2 Memory Controller with Bank-interleaving Support

Within the constraint of common input and output buses, the four DRAM banks can operate in parallel which is called bank-interleaving. When bank-interleaving is utilized, we can overlap the waiting latencies of DRAM operations in different banks to achieve lower effective latencies.

bus_send_resp

Fig. 3-11 Block diagram of the memory controller with bank-interleaving support

Fig. 3-11 is the block diagram of the memory controller with bank-interleaving support. There are four isolated trans_schedule and bank_ctrl pairs to generate commands for each bank. The bank_ctrl can only handle commands for one bank. To process the overall command issue, cmd_schedule is used.

To determine which bank can issue the command, we first calculate the “score”

of each bank. The score is computed by a function of command lasted time and command type. Then, the bank with highest score is selected for command issue. If there are two or more banks with equal score, bank 0 has the highest priority and bank 3 has the lowest. The commands which are not issued will be carried on by each bank_ctrl.

We implement three different scoring functions for the cmd_schedule. Each of them is shown in Section 3.5.4.

3.5.3 Transaction Scheduling Policy

(a)

(b)

(c)

Fig. 3-12 Examples of (a) first in first serve (FIFS) policy (b) two level round-robin (TLRR) policy (c) modified first in first serve (MFIFS) policy

A. First in first serve (FIFS)

Fig. 3-12(a) is an example. When the FIFS policy is applied, the output order of transactions is the same as the input order in spite of which master the transaction comes from.

B. Two level round-robin (TLRR)

When this policy is applied, we separate the masters into two levels, a high priority and a low priority level. Transactions from masters in high priority level are output first. As to the output order of transactions in the same level, round-robin scheme is utilized. Of course, transactions from the same master are output in input order.

In our simulator, only master 0 (CPU in the multimedia platform) is set to high priority level. Therefore, in Fig. 3-12(b) the two transactions from master 0 are output first and then transactions from master 1 and master 2 rotate.

We implement the TLRR policy which works pretty well in [9] for comparison..

C. Modified first in first serve (MFIFS)

Based on the assumption that transaction types from the same master are likely the same and transaction addresses from the same master are likely sequential, we propose this policy.

According to the assumption, bundling transactions from the same master together provides the memory controller more chances to perform successive DRAM read or write operations. However, although successive processing of transactions from one master can increase overall DRAM performance, it also increases latencies of transactions from other masters. To balance the performance improvement and the latency increment, we set a threshold to limit the maximum number of transactions in each successive processing.

We set the threshold to be 2 in the example shown in Fig. 3-12(c). First, as FIFS policy, the transaction from master 0 is output. Then, follow the input order to search another transaction from master 0 in the transaction buffer. If there is one, it is the next output. If there is none, the successive processing terminates. In Fig. 3-12(c), there is one and it is the last one in this successive processing since the threshold is 2.

After that, continue to apply FIFS policy to find the starting transaction of next successive processing.

3.5.4 Scoring Function

A. Lasted time only (LTO)

Before calculating the score, the validity of commands from each bank controller is examined. If the command is a NOP command or it cannot be issued at this time due to timing constraints of DRAM, the score is set to zero. Since zero score is used for such case, the score must larger than one in other cases. Thus, the scoring function is as below.

score = command_lasted_time + 1

B. Lasted time over type (LTOT)

The scoring function above makes the weight of command lasted time over that of command type. Thus, commands with longer lasted time always get higher scores.

As to commands with equal lasted time, the command type determines.

Because each bank operates independently, if proper utilized, latencies between two successive access commands (READ to READ, READ to WRITE, WRITE to

Because each bank operates independently, if proper utilized, latencies between two successive access commands (READ to READ, READ to WRITE, WRITE to

相關文件