• 沒有找到結果。

CHAPTER 3 MULTIMEDIA PLATFORM MODELING

3.5 M EMORY C ONTROLLER

3.5.4 Scoring Function

A. Lasted time only (LTO)

Before calculating the score, the validity of commands from each bank controller is examined. If the command is a NOP command or it cannot be issued at this time due to timing constraints of DRAM, the score is set to zero. Since zero score is used for such case, the score must larger than one in other cases. Thus, the scoring function is as below.

score = command_lasted_time + 1

B. Lasted time over type (LTOT)

The scoring function above makes the weight of command lasted time over that of command type. Thus, commands with longer lasted time always get higher scores.

As to commands with equal lasted time, the command type determines.

Because each bank operates independently, if proper utilized, latencies between two successive access commands (READ to READ, READ to WRITE, WRITE to READ, and WRITE to WRITE) can be hidden. Since the row miss case has longest latency, the PRECHARGE command pluses 3 points. With median latency, the ACTIVE command pluses 2 points. The 1 point for READ and WRITE commands is used to prevent zero score.

C. Lasted time with type (LTWT)

The scoring function is like that of LTOT. The only difference is that the factor of command lasted time is reduced to 2. Thus, the PRECHARGE and ACTIVE commands have more chances to be issued earlier.

Chapter 4

Simulation Result and Analysis

This chapter consists of three parts. Section 4.1 introduces several scenarios which can be applied to the simulator. Section 4.2 describes the settings of video phone scenario utilized in our simulation. In Section 4.3, simulation result and analysis are presented.

4.1 Introduction

Scenario Description Master / Task

Movie Playback Play a movie from

the storage

CPU / OS

DSP & Accelerator / Audio & video decoding

Audio Out / Output sound Video Out / Display video DTV Service Play DTV programs

from digital broadcast

CPU / OS

DSP & Accelerator / Audio & video decoding

Network / Radio broadcast bitstream Audio Out / Output sound

Video Out / Display video Video Phone Video phone CPU / Audio codec & OS

DSP / Video decoding

Accelerator / Video encoding

Network / Communication bitstream Audio In / Voice capture

Audio Out / Output voice Video In / Video capture Video Out / Display video

Table 4-1 Examples of scenarios which can be applied to the multimedia platform simulator

Since the multimedia platform simulator is configurable by access pattern files, it can be easily modified to perform many different scenarios. According to the

simulation result, the DRAM performance and power consumption introduced by the memory controller are evaluated. After that, the evaluation result can be taken into consideration for hardware implementation.

Table 4-1 lists examples of the scenario which can be applied to the multimedia platform simulator. We use the video phone scenario in our simulation because it utilizes all masters and is the most complicated case.

4.2 Simulation Environment Setting

The detailed settings of the video phone scenario are listed below. Table 4-2 shows the task allocation to each master and the access patterns generated according to the task allocation. Table 4-3 lists the bandwidth requirement and timing constraint of each master based on Table 4-2. The memory mapping is shown in Fig. 4-1.

As to other important settings of the simulator, we use 32-bit AXI data bus is and 16-bit memory data bus. The timing and power parameters of the memory model is the same as Micron’s MT46V8M16 DDR SDRAM [18]. Both the memory controller and DRAM operate at 200 MHz clock rate.

Master Task Access Pattern

CPU Audio Codec

OS

Read bitstream and PCM data Write bitstream and PCM data Random reads and writes for OS

DSP Video decoding

Miscellaneous routine

Read reference macroblock

Write reconstructed macroblock (YUV) Write reconstructed macroblock (RGB) Random reads and writes

Accelerator Video encoding Read reference macroblock

Write reconstructed macroblock (YUV) Write reconstructed macroblock (RGB) Random reads and writes

Network Tx/Rx bitstream Read bitstream Write bitstream Audio In Audio input Write PCM data Audio Out Audio output Read PCM data

Video In Video input Write capture video (RGB) Video Out Video output Read display video (RGB)

Table 4-2 Tasks and the corresponding access patterns of the video phone scenario

Master Bandwidth Requirement Timing Constraint

CPU 16.14 MB/sec 24 ms

Table 4-3 Bandwidth requirement and timing constraint of the video phone scenario

Decoding Frame 0

Fig. 4-1 Memory mapping of the video phone scenario

4.3 Simulation Result and Analysis

4.3.1 DRAM Bandwidth Utilization First

Instead of trying all possible combinations of each parameter, we use a simpler method to develop the memory controller.

First, we determine the data burst length since it affects the simulation result most. Based on chosen burst length, scoring functions for the command scheduler are evaluated. After that, we use the selected burst length and scoring function to find out a proper transaction scheduling policy.

During the memory controller development process, both two AXI network arbitration schemes are applied to see the bus arbitration scheme effect.

A. Simulation 1 – Choose a proper data burst length

Buffer size 8 entries

Data burst length 2, 4, 8 Bank-interleaving support Yes / No

Scoring function LTO

Transaction scheduling policy FIFS

Table 4-4 Configuration of the simulator in simulation 1

In simulation 1, different data burst lengths are applied to the memory controller with/without bank-interleaving support. Table 4-4 lists the configuration of the simulator.

Data burst length 2 4 8

Without bank-interleaving Violated Violated Violated With bank-interleaving Violated Met Met

Table 4-5 Timing constraint status when fixed priority bus arbitration scheme is applied

Average DRAM Bandwidth Utilization - Fixed Priority

Burst length 2 Burst length 4 Burst length 8

Bandwidth utilization

Without interleaving With interleaving

Fig. 4-2 Average DRAM bandwidth utilization with fixed priority bus arbitration scheme

Table 4-5 tells whether the timing constraints are met with each configuration when the fixed priority bus arbitration scheme is applied. Except for burst length 2, the timing constraints are met with bank-interleaving support.

Fig. 4-2 shows the average DRAM bandwidth utilization with fixed priority bus arbitration scheme. Without bank-interleaving support, burst length 2 gets lowest bandwidth utilization 26.5% while burst length 8 gets highest 41.6%. With bank-interleaving support, burst length 2 still gets lowest bandwidth utilization 27.1%

while burst length 8 remains highest 59.3%. With bank-interleaving support, the average bandwidth utilization improvement from burst length 2 to 8 is 2.1%, 55.9%, and 42.8% respectively.

Data burst length 2 4 8

Without bank-interleaving Violated Violated Met With bank-interleaving Violated Met Met

Table 4-6 Timing constraint status when round-robin bus arbitration scheme is applied

Average DRAM Bandwidth Utilization - Round-robin

Burst length 2 Burst length 4 Burst length 8

Bandwidth utilization

Without interleaving With interleaving

Fig. 4-3 Average DRAM bandwidth utilization with round-robin bus arbitration scheme

Table 4-6 tells whether the timing constraints are met with each configuration when the round-robin bus arbitration scheme is applied. Except for burst length 2, the timing constraints are met with bank-interleaving support.

Fig. 4-3 shows the average DRAM bandwidth utilization with round-robin bus arbitration scheme. Without bank-interleaving support, burst length 2 gets lowest bandwidth utilization 16.5% while burst length 8 gets highest 46.8%. With bank-interleaving support, burst length 2 still gets lowest bandwidth utilization 20.0%

while burst length 8 remains highest 67.0%. With bank-interleaving support, the average bandwidth utilization improvement from burst length 2 to 8 is 21.5% , 60.3%, and 43.1% respectively.

Note that when the burst length is 2, the average bandwidth utilization with round-robin is 25% to 35% lower than that with fixed priority. When the burst length is 4, the average bandwidth utilizations with both bus arbitration schemes are almost the same. And when the burst length is 8, the average bandwidth utilization with round-robin is 10% to 15% higher than that with fixed priority. Since transactions from the same master are not closely bundled together in time, there exists small intervals between two successive transactions. Therefore, the fixed priority bus arbitration scheme often causes transactions from two masters rotates. If the two masters are mapped to the same bank, row miss occurs repeatedly. If the two masters

generate transactions with different transaction types in a period, read-write turnaround happens over and over in that period. The two reasons make bandwidth utilization improvement diminished.

It is obvious that whether which bus arbitration scheme is applied or whether there is bank-interleaving support, burst length 8 always gets highest bandwidth utilization. Hence, we set the data burst length to 8 in the following simulations.

B. Simulation 2 – Choose a proper scoring function

Buffer size 8 entries

Data burst length 8 Bank-interleaving support Yes

Scoring function LTO, LTOT, LTWT Transaction scheduling policy FIFS

Table 4-7 Configuration of the simulator in simulation 2

Different scoring functions are applied in simulation 2 and Table 4-7 lists the configuration of the simulator.

Average DRAM Bandwidth Utilization

Fig. 4-4 Average DRAM bandwidth utilization with different scoring functions

Fig. 4-4 shows the average DRAM bandwidth utilization with different scoring functions. Both LTOT and LTWT work well when fixed priority is applied, the improvement is 17.7% and 17.4% individually. However, LTOT and LTWT works bad when round-robin is applied, the deterioration is 0.4% and 0.8% respectively.

Since the fixed priority bus arbitration scheme provides less parallelism for bank-interleaving, row miss latency hiding by LTOT or LTWT improves the bandwidth utilization a lot. On the contrary, round-robin provides sufficient parallelism for bank-interleaving. Thus, LTOT and LTWT may not benefit.

Because LTOT performs slightly better, we set the scoring function to LTOT.

C. Simulation 3 – Choose a proper transaction scheduling policy

Buffer size 8 entries

Data burst length 8 Bank-interleaving support Yes

Scoring function LTOT

Transaction scheduling policy FIFS, TLRR, MFIFS

Table 4-8 Configuration of the simulator in simulation 3

Different transaction scheduling policies are evaluated in simulation 3 and Table 4-8 lists the configuration of the simulator.

Fig. 4-5 shows the average DRAM bandwidth utilization with different transaction scheduling policies. When the bus arbitration scheme is fixed priority, TLRR is 8.7% worse and MFIFS is 2.8% better than FIFS. When the bus arbitration scheme is round-robin, TLRR is 7.8% worse and MFIFS is 6.1% better than FIFS.

Since TLRR is designed for the multimedia platform with dedicated channels to masters, it cannot work well with limited information caused by single on-chip bus and finite buffer size.

Of course, we choose MFIFS as the final transaction scheduling policy. However, the performance of MFIFS may differ with different buffer sizes and thresholds. Thus, an extra simulation is performed.

Average DRAM Bandwidth Utilization

Fig. 4-5 Average DRAM bandwidth utilization with different transaction scheduling policies

D. Simulation 4 – Choose a proper buffer size and threshold for MFIFS

Buffer size 4, 8, 12, 16 entries Data burst length 8

Bank-interleaving support Yes

Scoring function LTOT

Transaction scheduling policy MFIFS MFIFS threshold 2, 3, 4

Table 4-9 Configuration of the simulator in simulation 4

In simulation 4, different buffer sizes and thresholds are tested. Table 4-9 lists the configuration of the simulator.

Average DRAM Bandwidth Utilization

Buffer size 4 Buffer size 8 Buffer size 12 Buffer size 16

Bandwidth utilization FP_TH2

Fig. 4-6 Average DRAM bandwidth utilization with different buffer sizes and thresholds

Fig. 4-6 shows the average DRAM bandwidth utilization with different buffer sizes and thresholds. When the buffer size is 4, it bounds the bandwidth utilization since the memory controller cannot get sufficient information. When the buffer size is 12, the two bus arbitration schemes can merely affect the bandwidth utilization.

Fig. 4-7 and Fig. 4-8 presents the average transaction latency and DRAM power consumption individually. In Fig. 4-7, larger buffer size with equal transaction processing ability leads to longer latency. However, larger threshold does not inevitably increase the average transaction latency since it may slightly increase the latency of other masters while significantly decrease the latency of one master.

Based on Fig. 4-6, take Fig. 4-7 and Fig. 4-8 as reference, buffer size 12 and threshold 4 are chosen.

Average Transaction Latency

Buffer size 4 Buffer size 8 Buffer size 12 Buffer size 16

Latency (ns)

Fig. 4-7 Average transaction latency with different buffer sizes and thresholds

Average DRAM Power Consumption

Buffer size 4 Buffer size 8 Buffer size 12 Buffer size 16

Power (mW) FP_TH2

Fig. 4-8 Average DRAM power consumption with different buffer sizes and thresholds

E. Summary

Average DRAM Bandwidth Utilization Transition

0.00%

With LTOT With MFIFS With buffer size and threshold

modification

Bandwidth utilization

Fixed priority Round-robin

Fig. 4-9 Average DRAM bandwidth utilization transition through the simulations

Average Transaction Latency Transition

With LTOT With MFIFS With buffer size and threshold modification

Time (ns)

Fixed priority Round-robin

Fig. 4-10 Average transaction latency transition through the simulations

Average DRAM Power Consumption Transition

With LTOT With MFIFS With buffer size and threshold modification

Power (mW)

Fixed priority Round-robin

Fig. 4-11 Average DRAM power consumption transition through the simulations

Fig. 4-9 , Fig. 4-10, and Fig. 4-11 shows the average DRAM bandwidth utilization, transaction latency, and power consumption transitions through simulations.

In Fig. 4-10, when bank-interleaving is supported, the average transaction latency is reduced by 53.6% with fixed priority bus arbitration scheme and 33.9%

with round-robin. The significant reduction is because bank-interleaving can efficiently hide DRAM operation latencies.

According to Fig. 4-9 and Fig. 4-11, when the bus arbitration scheme is fixed priority, the bandwidth utilization is improved by 72.8% with 36.1% more power consumption. When the bus arbitration scheme is round-robin, the bandwidth utilization is improved by 53.3% with 11.9% more power consumption.

Note that MFIFS with buffer size and threshold modification can slightly increase the bandwidth utilization while decrease the power consumption up to 13%.

4.3.2 DRAM Power Consumption First

Since the video phone scenario does not require up to 71.8% bandwidth utilization to finish all tasks, reduce bandwidth utilization to achieve lower power

consumption is more favorable in the embedded system. In order to reduce power, we analyze the DRAM power consumption.

The DRAM power components listed in Section 2.2.3 can be divided into the background power, activate power, and read/write power.

The background power consists of the precharge power-down power, precharge standby power, active power-down power, active standby power, and refresh power.

In our simulation environment, DRAM is always in the active standby state. Thus, the effective background power is the summation of active standby power and refresh power. Therefore, the background power is fixed at all time.

The activate power is determined by the number of total ACTIVE commands and the task execution time. Thus, fewer ACTIVE commands or longer task execution time can lower activate power.

The read/write power is composed of the read power, write power, and I/O power. The read power and I/O power are decided by the number of total data reads and the task execution time. Also, the write power is decided by the number of total data writes and the task execution time. To lower read/write power, just reduce the read or write data count and stretch the task execution time. However, with the same access pattern, the number of data reads or writes is determined by the data burst length. Since the memory controller with shorter data burst length reads or writes fewer extra data, it consumes less power.

According to above analysis, to reduce DRAM power consumption, we should reduce the number of ACTIVE commands and the data burst length, and stretch the task execution time within timing constraints.

Take Table 4-5, Table 4-6, Fig. 4-9, and Fig. 4-11 into consideration, we choose the memory controller with data burst length 4, MFIFS transaction scheduling policy, and without bank-interleaving support. Though it may consume less power with data burst length 2, the probability of timing violation is also larger. Take Fig. 4-7 as the reference, we set all buffer size to 4 and the MFIFS threshold to 2 to suppress the increase of bandwidth utilization which stretches the task execution time.

Fixed priority Round-robin Average DRAM bandwidth utilization 42.9% 36.8%

Average DRAM power consumption 356.9 mW 423.4 mW

Table 4-10 Average DRAM bandwidth utilization and power consumption with different bus arbitration schemes

Table 4-10 lists the simulation result and there is no timing violation. Compared with the result in Section 4.3.1, the average DRAM power consumption is reduced by 26.3% with fixed priority bus arbitration scheme and 13.2% with round-robin.

Average DRAM Power Consumption

Power (mW) Total read/write power

Total activate power Total background power

Fig. 4-12 Average DRAM power consumption with different optimization policies

Fig. 4-12 shows each power component of the average DRAM power consumption with bandwidth utilization first and power consumption first. The background power is always fixed and takes about 30% of total power consumption.

When the fixed priority bus arbitration scheme is applied with power first, the activate and read/write power is reduced by 24.4% and 40.0% respectively. When the round-robin bus arbitration scheme is applied with power first, although the activate power increases by 116.2%, the reduction of read/write power by 48.5% still lowers total power consumption.

The result is not optimized. However, it is hard to find out an optimized result without thorough simulations since the related factors are not independent and affect each other.

Chapter 5

Hardware Implementation

There are two sections in the chapter. Section 5.1 describes the hardware design of memory controller. In Section 5.2, the implementation result is shown.

5.1 Hardware Design

Fig. 5-1 Hardware block diagram of the memory controller

In Section 4.3.1, we have developed a high performance memory controller architecture. Also, in Section 4.3.2, a low power with less performance architecture by reducing the applied techniques is presented. In real applications, the later one should be implemented since it completes all tasks with less cost. However, because we are eager to know the hardware cost when all techniques are applied, the former one is implemented.

Fig. 5-1 shows the hardware block diagram of the memory controller. The memory controller consists of five parts which are AXI interface, input and output buffer, transaction scheduler, command translator, and command controller.

The AXI interface handles VALID/READY channel handshaking. In addition, the AXI interface combines write and read address channels together for the single input address buffer with round-robin scheme.

The input and output buffer is composed of two input buffers and two output queues. All of them possess 12 entries.

The transaction scheduler reorders input transactions by the MFIFS transaction scheduling policy. We use two components to implement. The transaction reorder unit records IDs of input transactions and reorders them. The transaction issue unit gets corresponding transaction address and data for command translator by the output ID of transaction reorder unit.

Fig. 5-2 Block diagram of transaction reorder unit

Fig. 5-2 shows the block diagram of transaction reorder unit. The ID issue unit receives transaction IDs and addresses from the AXI interface and sends them to the corresponding ID buffer by bank addresses. All the ID buffers contain 12 entries. The threshold counter provides the successive access count to ID selector for output decision.

The command translator consists of four bank controllers which behave as stated in Section 3.5.1 for bank-interleaving support.

There are three states in the command controller. One for DRAM power-up initialization, one for SELF REFRESH power-down mode, and the last one is used for normal DRAM operations. In normal DRAM operations, the scoring function in the command scheduler determines which command and address provided by the command translator to be issued. The read and write unit translates data between single data rate and double data rate.

5.2 Implementation Result

Design

Proposed Kun-Bin Lee’s ARM PL340

Clock Rate

166 MHz 100 MHz 166 MHz

Transaction Scheduler 6688 12003 N/A

Command Translator 4257 N/A

Command Controller 11331 5362

N/A

Gate

Count

Total 47590 17365 About 60K

Table 5-1 Implementation result and comparison

Table 5-1 lists the implementation result and the comparison to other designs. It is obvious that in the proposed design, over 50% gate count is used for data buffering.

Table 5-1 lists the implementation result and the comparison to other designs. It is obvious that in the proposed design, over 50% gate count is used for data buffering.

相關文件