Organization - 適用於H.264解碼器的可調式雙層外部記憶體管理器

Chapter 1 Introduction

1.3 Organization

The organization of this thesis is as follows. An overview of external memory organization is introduced in Chapter 2. Besides the DRAM architecture and basic operation the past DRAM controller will be described. The DRAM development and DRAM trend are also discussed here.

My memory controller is applied in the H.264 decoder. This H.264 decoder is designed in transaction level model (TLM). The concept of TLM and the TLM model of our H.264 decoder will be thoroughly explained in Chapter 3. The role of my controller in this H.264 is mentioned here. There are two layers in my memory controller. Layer 0 is used to control the memory. Layer 1 is used to improve the performance in video process.

Chapter 4 presents layer 0 of my memory controller. This layer is called external memory interface (EMI). The system can communicate with external memory by using this EMI. The flexibility if this EMI can help users to integrate this EMI to their design easily. The different setting of EMI will influence the performance. The improvement on performance including latency and power will be compared here.

The experimental results of different setting will also be listed in this chapter.

The whole memory subsystem and the layer 1 of my memory controller which is called address translation machine are proposed in Chapter 5. The H.264 decoder

needs large amount of data transfer. The efficiency of the memory controller is very important. To deal with the bandwidth loss which degrades the performance, the layer 1 is combined into my memory controller. We have proposed a memory map method by using the characteristic of the H.264 process, and Layer 1 is used to do the address translation. This address translation machine not only reduces the power consumption on the bus line but also increases the bandwidth utilization. Finally, the conclusions and future work will be discussed in Chapter 6.

Chapter 2 Overview of External Memory Organization

In this chapter, it introduces the overview of external memory organization for video process. Firstly, DRAM characteristic is described in section 2.1. Then, section 2.2 discussed the techniques which were proposed in the past used in memory controller. In addition, the design trends of modern DRAM is presented in section 2.3.

2.1 DRAM characteristic

2.1.1 DRAM architecture

DRAM architecture is usually composed of the data memories, address decoders, row buffer, mode register, data buffer. Fig. 2.1 shows a simplified block diagram

Fig. 2.1 Simplified architecture of a DRAM.

Four bank share the address bus and command bus. Each bank has its own row decoder, column decoder, and sense amplifier. The mode register stores the DRAM operation mode, including burst length (BL), column address strobe latency (CL), and burst type, etc. Users can set the value of the mode register through address bus with proper command.

2.1.2 DRAM command and operation

The normal commands and its operation used in DRAM will be introduced as follow.

NO OPERATION (NOP):

The NOP command can prevent unwanted commands from being registered during idle or wait states. Operations already in progress are not affected.

ACTIVE:

This command is used to open a row in a particular bank. The row remains open for accesses until a PRECHARGE command is issued to that bank.

READ/WRITE:

The read/write command is used to initiate a read/write access to an active row, if auto precharge is selected, the row being accessed will be closed at the end of read.

PRECHARGE:

The precharge command is used to deactivate the open row in a particular bamk.

The bank will be available for a subsequent row access a specified time (tRP) REFRESH:

The refresh command can be used to retain data in the DRAM.

A memory access operation, which simplified state diagram is depicted in Fig.

2.2, contains three operation including row activation (ACTIVE), column access

Fig. 2.2 Bank state diagram.

The active command opens a particular row in one of the bank, and copies the row data into the row buffer. The active command needs a latency period called tRCD to accomplish this operation. Then, after tRCD delay a column access command (read / write) can be issued to sequential access data or single data according to the burst length and burst type set in the mode register. During the tRCD time, no other commands can be issued to the bank. However, commands to other banks are permissible due to the parallel processing capability of each bank. For read operation the valid data-out from the starting column address will be available following the CAS latency after the read command, as shown in Fig. 2.3. For write command in SDRAM, the first data-in is coincident with the write command, as shown in Fig. 2.4.

For write command in DDR DRAM, the first data-in is sent to DQ [1] strobe after a DQS [1] delay, as shown in Fig. 2.5. Finally a precharge command must be issued before opening a different row in the same bank.

Fig. 2.3 Read command CAS latency.

Fig. 2.4 Write command (SDRAM).

Fig. 2.5 Write command (DDR).

2.2 Past DRAM controller techniques

In the design of system-on-a-chip such as portable wireless device and multimedia system, several factor such as increased system complexity, time-to-market pressure, cost effectiveness, and various functionality requirement have made the trend of system-on-a-chip design indispensable. In general, SoC device are connect to the off-chip memory that store data to be transferred between

functional blocks. As the SoC integrate more functional block and need high performance, high data bandwidth is required to meet a given system specification.

Similarly Because of the tremendous data transfer and storage in video process, software or hardware must provide high data bandwidth to achieve the real-time request in multimedia system. External memory has the largest data capacity so it is often used as the frame memory in the video process. Nevertheless, the data transfer to off-chip memory is bound to the limited bandwidth.

Many external memory controllers have been proposed to improve the memory bandwidth utilization and achieve efficient memory access. In this section, some important techniques used in memory controller will be introduced.

2.2.1 Techniques and Improvement

According to the environment, the controllers can be categorized into two classes:

single channel and multiple channel environments. For single channel environment, Rixner memory access scheduler [2] reorder the access addresses from each bank controller and sends command to DRAM through arbiter. It can reduce the latency after reorder the address. However, the output command may be out-of-order, many command FIFOs and extra circuits are required to reorder commands and addresses.

Miura dynamic-SDRAM-mode-control scheme [3] eliminates the above disadvantage and it can both reduce operating current and the latency of an SDRAM.

Nevertheless, it only supports scheduling of single-channel. For multi-channel environment, Lee advances a quality-aware memory controller [4]. It supports different scheduling policies according to the current channel situation. These memory controllers mainly focus on general-purposed environment.

Concerning the particular-purpose memory controllers for the video codec application, several papers have been proposed on improvement of performance. They focus on the power consumption, latency (speed), and bandwidth utilization.

Kim memory interface architecture [5] reorganizes data arrangement in SDRAM The interface reduces energy consumption and increase memory bandwidth by placing the data in the same block in the same row. Thus, when accessing the block, the row-hit rate increases. The time and power waste on pre-charging the rows are decreased. Fig. 2.6 (a) is the original placement, and Fig. 2.6 (b) is the new placement after data arrangement.

Fig. 2.6 Data placement

Park history-based memory mode control [6] reduce row-miss rate to achieve 23.3 % reduced energy consumption and 18.8 % reduced memory latency. It uses history-based prediction to predict the next command is row-hit or row-miss. The prediction is implemented by a finite sate machine, as shown is Fig. 2.7. It will

the current row stay in the active state if it predict the next command is row-hit. The proposed memory controller is shown in Fig. 2.8.

Fig. 2.7 Mode control prediction

Fig. 2.8 Mode control memory controller

Zhu SDRAM controller in H.264 HDTV Decoder [7] focus on memory mapping and data arrangement in SDRAM to reduce the row active cycle, it also improves throughput and provide less power consumption. However, it adopts auto precharge rather than manual pre-charge leads to some loss of bus bandwidth and increase the

access latency.

Heithecker proposed a mixed QOS SDRAM controller [8]. The controller uses two types of scheduler to do the memory access. It can improve the memory bandwidth and latency for multiple access streams with different access sequence types running in parallel. Fig. 2.9 shows the overall controller structure with two scheduler variants.

Fig. 2.9 2-stage scheduler memory controller

Kang proposed a memory controller in the MPEG-4 AVC/H.264 decoder [9]. It uses a dual memory controller and dual bus to improve the memory bandwidth

The above memory control technique concentrate on three part, including memory scheduler, mode control, and data arrangement in SDRAM. The above discussion of related controllers and its techniques is summarized in Table 2.1.

Besides the techniques mentioned above, there are still other techniques can apply on the video process such as frame recompression and frame memory reorganization. Frame recompression is recompressing data before storing to memory, and decompression is required when reading data from memory, as shown in Fig.

2.10.

Fig. 2.10 recompression method

In this respect, many algorithms, such as Tajime [10] 2-D adaptive DPCM in pixel domain, and Lee [11] modified Hadamard transform and Golomb-Rice (GR) coding, etc have been proposed. However, frame recompression method need extra circuit and require additional execution cycles to compress data such that the throughput of video decoder degrades. For the memory reorganization can be found in De Greef [12]. Beside, Interuniversity MicroElectronic Center (IMEC) widely exploited this idea to H.264 decoder system [13], MPEG-4 motion estimation [14]

and video decode [15]. The concept of memory hierarchy [16] combined with merging structured frame memory can achieve data reuse and reduce the redundancy if data access. However, they only focus on C level simulation. If we want implement on ASIC design, many issues still have to be overcome. For advanced development, Chang combined frame memory architecture [17] can reduce frame memory size up to 57 % and reduce up to 83 % average latency.

Table 2.1 Related works of SDRAM memory controller

Related Application Improve Technique

Rixner General-purpose

Lee Multimedia SoC

multiple-channel

Park HDTV decoder

Multiple-channel

Heither Image process Multiple-channel

Bandwidth utilization Latency

schedule

Kang H.264 decoder

Multiple-channel

Bandwidth utilization

Dual controller

& dual bus

2.3 Modern DRAM Development

From the day DRAM has been invented, the requests of performance accelerate very fast. The most important issues are bandwidth, latency, and power. This section will introduce the development of DRAM that improve the performance and the future trend of DRAM.

2.3.1 Bandwidth

The improvement of DRAM bandwidth has never satisfied the increasingly complicated application such as multimedia and 3D processing. To fulfill the demand for high bandwidth, various new DRAM specifications have been announced by DRAM manufacturers. The SDRAM standards supported by JEDEC [18] have become the mainstream of DRAM market. Several techniques have been applied on the latest standards announced by JEDED to provide users higher bandwidth.

Fig. 2.11 Operating frequency of SDR, DDR-1, DDR-2

Fig. 2.11 shows the operating frequency of each component of SDR, DDR-1, and DDR-2. As we can see, in SDR, the core and the I/O are running at the same frequency, and the data is transferred at the positive edge of clock. In DDR-1, a 2-bit

data PREFETCH technique is applied. Data is transferred at positive and negative edge of clock. The data rate of DDR-1 is twice as SDR. In DDR-2, the PREFETCH technique is further promoted up to 4-bit. The I/O frequency is twice as DDR-1, so DDR-2 can provide bandwidth twice as DDR-1 can. The PREFETCH technique makes DRAM be able to provide quadruple bandwidth than SDR with core frequency remains unchanged.

In SDR and DDR-1, row activation takes a period of time called tRCD before column access command can be issued. This may induce overhead due to the bus confliction. For example, Fig. 2.12 shows the difference between DDR and DDR-2.

The top part is the timing diagram of DDR. Because of the bus confliction The third active command can not be issued at 4th cycle. It must wait a cycle therefore make a bubble on the data bus. The bottom part of Fig. 2.12 is the timing diagram of DDR-2.

DDR-2 allows users to issue subsequent read command right after an active command.

The read command will be buffered inside DDR-2 and will not be processed until the tRCD latency is reached. Thus the bus confliction is prevented, the bubble is eliminated and the bandwidth utilization becomes better.

Fig. 2.12 Timing diagram of DDR & DDR-2

2.3.2 Latency

The DRAM response latency can directly influence the speed of the whole system. The speed of the system for the multimedia process is very essential to achieve the real-time request. So if the DRAM latency is shorter, the whole system can boost its performance. However, the situation is not as we expected. Fig. 2.13 compares the performance trend of CPU and DRAM. While CPU clock speed increases 7.65 times, DRAM latency also has a 4.6 times increase. The improvement of CPU is much faster than the improvement of DRAM. Long response latency waste its processing power on waiting and the performance is limited.

Fig. 2.13 CPU VS DRAM performance

Since Year 2000, DRAM manufacturers have begun to pay attention to the impact of DRAM latency and proposed several solution including RLDRAM (reduce latency DRAM) [19], FCRAM (Fast cycle RAM) [20], and Network DRAM [21] etc.

In the following we will introduce RLDRAM for example.

RLDRAM-1 which mainly aimed at application requiring short latency and high bandwidth was announced by micron in 2001. RLDRAM-2 was announced in 2003 and can be taken as an enhanced version of RLDRAM-1. Some features if RLDRAM-2 are listed as followed.

The traditional DRAM is divided into 4 bank while RLDRAM can divide into 8

banks. The successive accesses to the same bank cost more latency than the successive accesses to the different banks, shows in Fig. 2.14 and Fig. 2.15. So if the number of banks increases, the rate of accessing different banks increases. So the latency of RLDRAM is shorter than the traditional DRAM. On the other hand, Row cycle time (tRC) defined the shortest time interval needed for two successive ‘active’

commands addressed to the same bank. The row cycle time is 20ns in RLDRAM and 72ns in RDRAM respectively. When the row miss happens, RLDRAM can apparently save much access latency.

RLDRAM has another advantage of saving latency. It separates data bus for read and write data, RLDRAM-2 can effectively reduce the latency caused by the turnaround cycles.

Fig. 2.14 Accesses addressed to same bank

Fig. 2.15 Accesses addressed to different bank

2.3.3 Power

In many application of portable wireless devices such as mobile and PDA, power consumption is the significant issue because of battery life is limited. With the application of multimedia becomes popular, the request of memory size is larger.

Accordingly, the designers often select DRAM to be the body memory component.

In order to reduce the power of DRAM, many products have been invented for low power such as BAT-RAM from micron [22] and Mobile-RAM from Infineon [23].

The low-power DRAM has some special features inside.

Low Operating voltage

Compare with SDR SDRAM, the operating voltage of low-power DRAM is lowered from 3.3v to 1.8v. Thus, the power consumption can significantly decreases.

Output Driver Strength

Because the low-power DRAM is designed for use in smaller systems that are typically point-to-point connection, an option to control the drive strength of the output buffers is provided. Drive strength should be selected based on expected loading of the memory bus. There are four allowable setting for the output drivers, including full strength driver, half strength driver, quarter strength driver, and one-eighth strength driver.

Temperature Compensated Self Refresh (TCSR)

Most of the time mobile devices stay in standby mode and DRAM can enter sef refresh mode to save unnecessary power consumption. In the self-refresh mode, DRAM will refresh the data stored in the DRAM cell. The refresh period is inversely proportional to temperature, traditional DRAM can only support single refresh period which is the worst condition. In the low-power DRAM, a temperature sensor is

implemented for auto control of the self refresh oscillator on the device. Therefore, the refresh current is decreasing while the temperature is low, as shown in table 2.2.

Table 2.2 Current of different condition

Partial Array Self Refresh

For further power savings during SELF REFRESH, the PASR feature enables the control to select the amount of memory that will be refreshed during SELF REFRESH.

The option is shown in Table 2.3. The current can be reduced as shown in Table 2.1.

Table 2.3 Amount of memory will be refreshed

Stopping the external clock

One method of controlling the power efficiency in applications is to throttle the clock that controls the SDRAM. There are two basic ways to control the clock:

1. Change the clock frequency, when the data transfers require a different rate of speed.

2. Stopping the clock altogether.

Both of these are specific to the application and its requirements and both allow power savings due to possible fewer transitions on the clock path.

The clock can also be stopped altogether if there are no data accesses in progress, either WRITE or READ, that would be affected by this change; i.e., if a WRITE or a READ is in progress, the entire data burst must be through the pipeline prior to stopping the clock.

For the full duration of the clock stop mode. One clock cycle and at least one NOP is required after the clock is restarted before a valid command can be issued. Fig.

2.16 illustrates the clock stop mode.

It is recommended that the DRAM be in a pre-charged state if any changes to the clock frequency are expected. This will eliminate timing violations that may

otherwise occur during normal operations.

Fig. 2.16 Clock stop mode

Power-Down

Power down can occurs when all banks idle, this mode is referred to as precharge power-down. If power down occurs when there is a row active in the bank, this mode is referred as active power-down. Entering power-down mode deactivates all input and output buffers, therefore the power is saved.

Deep Power-Down

Deep power down is an operating mode used to achieve maximum power reduction by eliminating the power of the memory array. Data will not be retained when the device enters power-down mode. Since DRAM is often used as temporary data buffers, enter DPD mode while the device is in standby mode won’t cause any loss.

2.4 Future trend of DRAM

DDR-3

Beside DDR-2, the draft of DDR-3 industry standard has been announced by JEDEC in Oct 2002. DDR-3 SDRAM which is due to enter volume production in 2006, operates from a 1.5 volts supply and can transfer data at up to 1,066-Mbits per second. That's twice as fast as DDR2, which is just coming into the market in volume,

在文檔中適用於H.264解碼器的可調式雙層外部記憶體管理器 (頁 13-0)