適用於H.264解碼器的可調式雙層外部記憶體管理器

(1)

國

立

交

通

大

學

電子工程學系電子研究所碩士班

碩

士

論

文

適用於 H.264 解碼器的可調式雙層外部記憶體管理器

A Flexible Two-Layer External Memory Management for

H.264/AVC Decoder

研究生：張長軒

指導教授：黃威教授

(2)

適用於 H.264 解碼器的可調式雙層外部記憶體管理器

A Flexible Two-Layer External Memory Management for H.264/AVC Decoder

研究生：張長軒 Student：Chang-Hsuan Chang

指導教授：黃威教授 Advisor：Prof. Wei Hwang

國立交通大學

電子工程學系電子研究所

碩士論文

A Thesis

Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Engineering

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Electronics Engineering

June 2006

Hsinchu, Taiwan, Republic of China

(3)

適用於

H.264 解碼器的可調式雙層記憶體管理器

學生：張長軒指導教授：黃威教授

國立交通大學電子工程學系電子研究所

摘要

在 H.264 解碼器中有大量的資料需要去存取記憶體,外部記憶體資料存取所需的時間和功率消耗對於整個 H.264 系統的效能影響很大,因為外部記憶體受到頻寬的限制所以外部記憶體的效能對於系統仍然是個瓶頸.在 H.264 解碼器中為了達到即時播放的效果,如何設計一個良好的記憶體子系統變的十分重要. 在這篇論文中,我門針對 H.264 解碼器提出了一個計憶體子系統,這個子系統包含了一個可調變的記憶體控制器,靜態隨機存取記憶體,雙倍數同步動態隨機存取記憶體,和一條匯流排. 子系統裡面的記憶體控制器除了要確保資料傳輸的正確性以外還要能改善外部記憶體的效能,我們所提出來的記憶體控制器共分成兩層,第一層是用來做位址轉換,這一層提供了一個存取記憶體的位址產生方法,這個方法可以就由減少記憶體中誤失的數目來達到節省功率的效果. 第二層是外部記憶體介面,除了產生適當的命令給記憶體外還可以降低因為記憶體誤失所耗費的時間,這個記憶體介面可以用來控制單倍數和雙倍數同步動態隨機存取記憶體,因為這個外部記憶體介面的可調變性,使用者可以快速的把這個介面隨著不同型態的外部記憶體整合到系統裡. 這個 H.264 解碼器中的記憶體控制器可以減少資料存取延遲時間到大約 30%,而且頻寬的使用率可以從 42%提升到 51%左右.

(4)

A Two-Layer Flexible External Memory Management for

H.264/AVC Decoder

Student : Chang-Hsuan Chang

Advisors : Prof. Wei Hwang

Department of Electronics Engineering & Institute of Electronics

National Chiao-Tung University

ABSTRACT

In the H.264/AVC decoder there are large amount of data need to be fetched to/from the off-chip memory. The latency of accessing data and power consumption in the off-chip memory greatly affect the performance of the whole system. The performance of the off-chip memory is still the bottleneck of the video process due to the limited bandwidth. The real-time requirement in H.264/AVC decoder at level 4 result in the request of a well designed memory sub-system.

In this thesis, we proposed a memory sub-system for the H.264/AVC decoder. This memory sub-system contains a flexible memory controller, SRAM, AHB-bus, and DDR SDRAM.

This memory controller inside the memory sub-system not only keeps the correctness of the data transfer between the module and DRAM but is also responsible to improve the performance of the external memory. The proposed memory controller consists of two layers. The first layer is address translation which provide an efficient memory map method. This layer is designed to decrease the number of row-miss status and bank-miss status. By this way the power consumption can be saved.

The second layer is external memory interface. Besides generate appropriate commands to the off-chip memory, this layer is used to increase bandwidth utilization and decrease the latency induced by the row-miss status and bank-miss status. This external memory interface is design to control the SDR SDRM and DDR SDRAM. Due to the flexibility of this EMI, the users can rapidly integrate this EMI into their system design with various kind of external memory.

The experimental results of a H.264 decoder show that the proposed controller can further reduce access latency by approximately 30% and the memory utilization is

(5)

Acknowledgements

我要感謝我的老師黃威教授這兩年對我的指導和鼓勵,在研究過程中提供了很多方向和指引,才讓我的研究可以順利完成,特別感謝老師能讓我同時結合記憶體和多媒體這個研究領域,讓我這兩年的研究雖然辛苦但是充滿了樂趣. 另外要特別的感謝就是跟我同一個團隊的老師,學長和同學,要謝謝蔣迪豪教授讓我加入他們研究的 H.264 (TLM)團隊.使我有機會可以接觸多媒體這一塊領域.在團隊工作期間,成功大學的李國君教授給了我門團隊很多的指導,我們的團隊的主導人彭文孝學長,日前已經到資工系當教授了,更是提供了很多不同的方向讓我們來思考,從學長的身上學到很多東西. 另外要特別由衷感謝給我最多幫助的同一個團隊的博班學長李志鴻和陳治傑同學,在研究過程給了我相當多的支援與協助,才使得我的論文可以順利完成接下來要感謝同一個實驗室的黃柏蒼學長,華重憲學長,和張銘宏學長,在我的研究過程幫助了我很多也教導了我很多,從學長身上得到很多寶貴的建議. 最後要感謝我的家人和朋友在研究過程給我的打氣與鼓勵以及關心,讓我的研究過程雖然辛苦但還是充滿歡樂.

(6)

Chapter 1 Introduction...1

1.1 Background ...1

1.2 Motivation...2

1.3 Organization...3

Chapter 2 Overview of External Memory Organization ...5

2.1 DRAM characteristic ...5

2.1.1 DRAM architecture...5

2.1.2 DRAM command and operation ...6

2.2 Past DRAM controller techniques ...8

2.2.1 Techniques and Improvement ...9

2.3 Modern DRAM Development ...15

2.3.1 Bandwidth ...15

2.3.2 Latency...17

2.3.3 Power ...19

2.4 Future trend of DRAM...22

Chapter 3 Transaction Level Modeling of H.264/AVC Decoder ...25

3.1 Overview of H.264...25

3.1.1 Introduction...25

3.1.2 Characteristic of H.264 ...27

3.2 Transaction Level Modeling ...29

3.2.1 Introduction of TLM ...29

3.2.2 Design flow with TLM ...31

3.3 Transaction Level Modeling of H.264 Decoder ...34

3.3.1 Specification ...34

3.3.2 System Architecture of H.264 decoder ...36

3.3.2.1 Video Pipe...37

3.3.2.2 System Schedule ...39

3.3.2.3 System Modeling ...41

3.4 Memory Controller in H.264 Decoder...42

Chapter 4 External Memory Interface ...45

4.1 Concept of EMI...46

4.2 Architecture of EMI ...47

4.2.1 FIFO...48

4.2.2 Mode Control ...52

4.2.3 Finite States Machine and schedule block ...54

(7)

4.2.3.2 Schedule block ...56

4.2.4 Counter and Timing Checker ...57

4.3 The external memory in H.264 decoder...59

4.4 Analysis...61

4.4.1 Access Latency...61

4.4.2 Power Estimation ...61

4.4.3 Auto-pre-charge Method Effect ...63

4.4.4 FIFO Size Effect ...66

4.5 Summary ...69

Chapter 5 Address Translation Machine & Memory Subsystem for

H.264 Decoder ...71

5.1 Introduction...71

5.2 Memory Subsystem Architecture...74

5.3 Data Arrangement ...76

5.3.1 Memory Mapping Method...76

5.3.2 Latency Estimation ...80

5.3.3 Data Bus Schdule...84

5.4 Address Translation...86

5.5 Analysis & Simulation Result...87

5.5.1 Design for worst case...87

5.5.2 Design for average case ...96

5.6 Summary ...99

Chapter 6 Conclusions and Future work ...101

6.1 Conclusions...101

6.2 Future Work ...102

(8)

List of Figures

Fig. 2.1 Simplified architecture of a DRAM. ...5

Fig. 2.2 Bank state diagram. ...7

Fig. 2.3 Read command CAS latency. ...8

Fig. 2.4 Write command (SDRAM)...8

Fig. 2.5 Write command (DDR). ...8

Fig. 2.6 Data placement ...10

Fig. 2.7 Mode control prediction ... 11

Fig. 2.8 Mode control memory controller... 11

Fig. 2.9 2-stage scheduler memory controller ...12

Fig. 2.10 recompression method ...13

Fig. 2.11 Operating frequency of SDR, DDR-1, DDR-2...15

Fig. 2.12 Timing diagram of DDR & DDR-2...16

Fig. 2.13 CPU VS DRAM performance ...17

Fig. 2.14 Accesses addressed to same bank...18

Fig. 2.15 Accesses addressed to different bank ...18

Fig. 2.16 Clock stop mode ...21

Fig. 2.17 The required data rate in 3DG engine...24

Fig. 3.1 Scope of video coding standardization...26

Fig. 3.2 System model at different levels of abstraction...30

Fig. 3.3 Design floe with transaction level modeling ...32

Fig. 3.4 System architecture diagram ...36

Fig. 3.5 De-blocking process order...38

Fig. 3.6 Transaction level modeling at system level...41

Fig. 3.7 Relationship between H.264 decider and external memory ...42

Fig. 3.8 Architecture of the memory controller in H.264 decoder...43

Fig. 4.1 Architecture of my memory controller ...45

Fig. 4.2 Connection of EMI ...46

Fig. 4.3 Architecture of EMI...47

Fig. 4.4 Consecutive accesses to same row (burst length=4) ...48

Fig. 4.5 Consecutive accesses to different row (burst length=4) ...49

Fig. 4.6 Consecutive accesses to different row with auto-pre-charge ...49

Fig. 4.7 Command FIFO...50

Fig. 4.8 State diagram of bank-miss status ...52

Fig. 4.9 State diagram of row-hit and row-miss status ...53

Fig. 4.10 Mode control block...53

(9)

Fig. 4.12 Finite State Machine ...55

Fig. 4.13 Command with schedule and without schedule (two bank-miss) ...56

Fig. 4.14 Command with schedule and without schedule (two row-miss) ...57

Fig. 4.15 Read operation of SDR and DDR...60

Fig. 4.16 Write operation of SDR and DDR ...60

Fig. 4.17 Latency of different method for video process...64

Fig. 4.18 Latency of different method for random process ...65

Fig. 4.19 Latency of different method for video process_1...65

Fig. 4.20 Dynamic method 1 versus row-open method in the video process ...66

Fig. 4.21 Architecture of stall machine...67

Fig. 4.22 Respond latency v.s read-data-FIFO Size...67

Fig. 4.23 Respond latency v.s write-data-FIFO Size ...68

Fig. 5.1 The memory controller in H.264 decoder ...73

Fig. 5.2 Memory subsystem Architecture ...74

Fig. 5.3 Architecture of memory organization...76

Fig. 5.4 Frame map to memory...77

Fig. 5.5 Memory map to frame ...78

Fig. 5.6 Frame map to memory (1 or 2)...79

Fig. 5.7 Worst case of accessing a 4 x 4 luma block (P frame)...81

Fig. 5.8 Worst case of accessing a 8 x 8 luma block (B frame) ...82

Fig. 5.9 Read operation for motion compensation...82

Fig. 5.11 Layer 1 architecture ...86

Fig. 5.13 Display order & Decode order...89

Fig. 5.14 Latency of different frame ...93

Fig. 5.15 Active number of different frame ...94

Fig. 5.16 Pre-charge number of different frame ...94

Fig. 5.17 Bus utilization of different frame ...95

Fig. 5.18 The efficiency of DRAM access for MC...97

Fig. 5.19 The efficiency of DRAM access for MC,DB, and DEI...97

Fig. 5.20 Average cycle count of DRAM access per MB ...98

Fig. 5.21 The distribution of DRAM commands per MB...98

(10)

List of Tables

Table 2.1 Related works of SDRAM memory controller ...14

Table 2.2 Current of different condition ...20

Table 2.3 Amount of memory will be refreshed ...20

Table 3.1 Characteristics of different models ...31

Table 3.2 Level limits ...35

Table 3.3 System schdule...40

Table 4.1 Different method for receiving a picture...51

Table 4.2 Timing parameters of Micron Mobile DDR DRAM ...58

Table 4.3 Status of different process ...61

Table 4.4 Power of each state ...62

Table 4.5 State transition of different process...63

Table 5.1 Dram requirement for real time requirement in design of worst case...75

Table 5.2 Different memory mapping method for different memory configuration77 Table 5.3 Cycle count of each module in worst case ...83

Table 5.4 EMI setting...88

Table 5.5 Key parameters of micron Mobile-DDR SDRAM ...89

Table 5.6 Number of commands in each frame (Luma) ...90

Table 5.7 Some cases in the memory (Luma)...91

Table 5.8 Some cases in the memory (Luma)...92

(11)

Chapter 1 Introduction

1.1 Background

Multimedia processing technologies have been widely applied in many systems.

These technologies have not only provided existing applications like desktop

video/audio but also spawned brand new industries and services like digital video

recording, video-on-demand services, high-definition TV, etc. To support complex

multimedia applications, architectures of multimedia systems must provide high

computing power and high data bandwidth. Furthermore, a multimedia operation

system should support real-time scheduling.

Previous researches have shown that data transfer and storage of

multidimensional array signals dominate the performance and power consumption in

system-on-a-chip (SoC) designs. In particular, the data transfer to off-chip memory is

especially important due to the scarce resource of off-chip bandwidth. In fact,

although the tremendous progress in VLSI technology provides an ever-increasing

number of transistors and routing resource on a single chip, and hence allows

integrating heterogeneous control and computing functions to realize SoCs, the

improvement on off-chip communication is limited due to the number of available

input/output (I/O) pins and the physical design issues of these pins. As many recent

studies have shown, the off-chip memory system is one of the primary performance

(12)

1.2 Motivation

As the resolution of video-processing applications becomes high, video signal

processors should deal with a large amount of data within a tightly bounded time. Due

to the large quantities, video data are stored in off-chip memories that are usually slow,

and thus the system performance strongly depends on the memory bandwidth between

processors and external memories. For example, in decoding MP@HL video streams,

an MPEG-2 video decoder must feature 10-MB memory as frame buffers and should

access about 370 million data/s.

H.264/AVC is the newest international video coding standard. Relative to prior

video coding methods such as MPEG-2 video, H.264/AVC has the higher coding

efficiency. With an increasing number of services and growing popularity of high

definition TV are creating greater request for higher coding efficiency. Moreover,

other transmission media such as Cable Modem, XDSL, or UMTS offer much lower

data rates than broadcast channels, and enhanced coding efficiency can enable the

transmission of more video channels or higher quality video representations within

existing digital transmission capacities.

In the H.264 decoder, it process 245760 MB/s in maximum at level 4. In order to

support the complex mode in the H.264 decoder, there are even large data transfer

between H.264 decoder and external memory.

The issue of latency, bandwidth utilization, and power consumption are very

important in the video process. These issues in the video process are greatly

influenced by the memory subsystem. Hence a memory controller that can efficiently

communicate with external memory is the most significant part over the entire video

(13)

dominates the total amount of data transmission especially when SDRAM is adopted

as external frame memories. How to manage the data transfer and utilize the limited

bandwidth is the most important issue of the memory controller.

1.3 Organization

The organization of this thesis is as follows. An overview of external memory

organization is introduced in Chapter 2. Besides the DRAM architecture and basic

operation the past DRAM controller will be described. The DRAM development and

DRAM trend are also discussed here.

My memory controller is applied in the H.264 decoder. This H.264 decoder is

designed in transaction level model (TLM). The concept of TLM and the TLM model

of our H.264 decoder will be thoroughly explained in Chapter 3. The role of my

controller in this H.264 is mentioned here. There are two layers in my memory

controller. Layer 0 is used to control the memory. Layer 1 is used to improve the

performance in video process.

Chapter 4 presents layer 0 of my memory controller. This layer is called external

memory interface (EMI). The system can communicate with external memory by

using this EMI. The flexibility if this EMI can help users to integrate this EMI to their

design easily. The different setting of EMI will influence the performance. The

improvement on performance including latency and power will be compared here.

The experimental results of different setting will also be listed in this chapter.

The whole memory subsystem and the layer 1 of my memory controller which is

(14)

needs large amount of data transfer. The efficiency of the memory controller is very

important. To deal with the bandwidth loss which degrades the performance, the layer

1 is combined into my memory controller. We have proposed a memory map method

by using the characteristic of the H.264 process, and Layer 1 is used to do the address

translation. This address translation machine not only reduces the power consumption

on the bus line but also increases the bandwidth utilization. Finally, the conclusions

(15)

Chapter 2 Overview of External Memory

Organization

In this chapter, it introduces the overview of external memory organization for video process. Firstly, DRAM characteristic is described in section 2.1. Then, section 2.2 discussed the techniques which were proposed in the past used in memory controller. In addition, the design trends of modern DRAM is presented in section 2.3.

2.1 DRAM characteristic

2.1.1 DRAM architecture

DRAM architecture is usually composed of the data memories, address decoders,

row buffer, mode register, data buffer. Fig. 2.1 shows a simplified block diagram

(16)

Four bank share the address bus and command bus. Each bank has its own row

decoder, column decoder, and sense amplifier. The mode register stores the DRAM

operation mode, including burst length (BL), column address strobe latency (CL), and

burst type, etc. Users can set the value of the mode register through address bus with

proper command.

2.1.2 DRAM command and operation

The normal commands and its operation used in DRAM will be introduced as

follow.

NO OPERATION (NOP):

The NOP command can prevent unwanted commands from being registered

during idle or wait states. Operations already in progress are not affected.

ACTIVE:

This command is used to open a row in a particular bank. The row remains open

for accesses until a PRECHARGE command is issued to that bank.

READ/WRITE:

The read/write command is used to initiate a read/write access to an active row,

if auto precharge is selected, the row being accessed will be closed at the end of read.

PRECHARGE:

The precharge command is used to deactivate the open row in a particular bamk.

The bank will be available for a subsequent row access a specified time (tRP)

REFRESH:

The refresh command can be used to retain data in the DRAM.

A memory access operation, which simplified state diagram is depicted in Fig.

(17)

Fig. 2.2 Bank state diagram.

The active command opens a particular row in one of the bank, and copies the

row data into the row buffer. The active command needs a latency period called tRCD

to accomplish this operation. Then, after tRCD delay a column access command (read

/ write) can be issued to sequential access data or single data according to the burst

length and burst type set in the mode register. During the tRCD time, no other

commands can be issued to the bank. However, commands to other banks are

permissible due to the parallel processing capability of each bank. For read operation

the valid data-out from the starting column address will be available following the

CAS latency after the read command, as shown in Fig. 2.3. For write command in

SDRAM, the first data-in is coincident with the write command, as shown in Fig. 2.4.

For write command in DDR DRAM, the first data-in is sent to DQ [1] strobe after a

DQS [1] delay, as shown in Fig. 2.5. Finally a precharge command must be issued

(18)

Fig. 2.3 Read command CAS latency.

Fig. 2.4 Write command (SDRAM).

Fig. 2.5 Write command (DDR).

2.2 Past DRAM controller techniques

In the design of system-on-a-chip such as portable wireless device and

multimedia system, several factor such as increased system complexity,

time-to-market pressure, cost effectiveness, and various functionality requirement

have made the trend of system-on-a-chip design indispensable. In general, SoC device

(19)

functional blocks. As the SoC integrate more functional block and need high

performance, high data bandwidth is required to meet a given system specification.

Similarly Because of the tremendous data transfer and storage in video process,

software or hardware must provide high data bandwidth to achieve the real-time

request in multimedia system. External memory has the largest data capacity so it is

often used as the frame memory in the video process. Nevertheless, the data transfer

to off-chip memory is bound to the limited bandwidth.

Many external memory controllers have been proposed to improve the memory

bandwidth utilization and achieve efficient memory access. In this section, some

important techniques used in memory controller will be introduced.

2.2.1 Techniques and Improvement

According to the environment, the controllers can be categorized into two classes:

single channel and multiple channel environments. For single channel environment,

Rixner memory access scheduler [2] reorder the access addresses from each bank

controller and sends command to DRAM through arbiter. It can reduce the latency

after reorder the address. However, the output command may be out-of-order, many

command FIFOs and extra circuits are required to reorder commands and addresses.

Miura dynamic-SDRAM-mode-control scheme [3] eliminates the above

disadvantage and it can both reduce operating current and the latency of an SDRAM.

Nevertheless, it only supports scheduling of single-channel. For multi-channel

environment, Lee advances a quality-aware memory controller [4]. It supports

different scheduling policies according to the current channel situation. These

(20)

Concerning the particular-purpose memory controllers for the video codec

application, several papers have been proposed on improvement of performance. They

focus on the power consumption, latency (speed), and bandwidth utilization.

Kim memory interface architecture [5] reorganizes data arrangement in SDRAM

The interface reduces energy consumption and increase memory bandwidth by

placing the data in the same block in the same row. Thus, when accessing the block,

the row-hit rate increases. The time and power waste on pre-charging the rows are

decreased. Fig. 2.6 (a) is the original placement, and Fig. 2.6 (b) is the new placement

after data arrangement.

Fig. 2.6 Data placement

Park history-based memory mode control [6] reduce row-miss rate to achieve

23.3 % reduced energy consumption and 18.8 % reduced memory latency. It uses

history-based prediction to predict the next command is row-hit or row-miss. The

(21)

the current row stay in the active state if it predict the next command is row-hit. The

proposed memory controller is shown in Fig. 2.8.

Fig. 2.7 Mode control prediction

Fig. 2.8 Mode control memory controller

Zhu SDRAM controller in H.264 HDTV Decoder [7] focus on memory mapping

and data arrangement in SDRAM to reduce the row active cycle, it also improves

throughput and provide less power consumption. However, it adopts auto precharge

(22)

access latency.

Heithecker proposed a mixed QOS SDRAM controller [8]. The controller uses

two types of scheduler to do the memory access. It can improve the memory

bandwidth and latency for multiple access streams with different access sequence

types running in parallel. Fig. 2.9 shows the overall controller structure with two

scheduler variants.

Fig. 2.9 2-stage scheduler memory controller

Kang proposed a memory controller in the MPEG-4 AVC/H.264 decoder [9]. It

uses a dual memory controller and dual bus to improve the memory bandwidth

The above memory control technique concentrate on three part, including

memory scheduler, mode control, and data arrangement in SDRAM. The above

discussion of related controllers and its techniques is summarized in Table 2.1.

Besides the techniques mentioned above, there are still other techniques can

apply on the video process such as frame recompression and frame memory

reorganization. Frame recompression is recompressing data before storing to memory,

and decompression is required when reading data from memory, as shown in Fig.

(23)

Fig. 2.10 recompression method

In this respect, many algorithms, such as Tajime [10] 2-D adaptive DPCM in

pixel domain, and Lee [11] modified Hadamard transform and Golomb-Rice (GR)

coding, etc have been proposed. However, frame recompression method need extra

circuit and require additional execution cycles to compress data such that the

throughput of video decoder degrades. For the memory reorganization can be found in

De Greef [12]. Beside, Interuniversity MicroElectronic Center (IMEC) widely

exploited this idea to H.264 decoder system [13], MPEG-4 motion estimation [14]

and video decode [15]. The concept of memory hierarchy [16] combined with

merging structured frame memory can achieve data reuse and reduce the redundancy

if data access. However, they only focus on C level simulation. If we want implement

on ASIC design, many issues still have to be overcome. For advanced development,

Chang combined frame memory architecture [17] can reduce frame memory size up

(24)

Table 2.1 Related works of SDRAM memory controller

Related Application Improve Technique

Rixner General-purpose single-channel

Bandwidth

Latency

Memory schedule

Miura 32 bit RISC CPU

single channel STB

Latency

Power

Memory

mode control

Lee Multimedia SoC

multiple-channel STB Bandwidth Latency Memory schedule Kim Video single channel Power Bandwidth Data arrangement Park HDTV decoder Multiple-channel Latency power Memory mode control Zhu H.264 decoder Multiple-channel Bandwidth utilization Decoding throughput Data arrangement

Heither Image process

Multiple-channel Bandwidth utilization Latency schedule Kang H.264 decoder Multiple-channel Bandwidth utilization Dual controller

(25)

2.3 Modern DRAM Development

From the day DRAM has been invented, the requests of performance accelerate

very fast. The most important issues are bandwidth, latency, and power. This section

will introduce the development of DRAM that improve the performance and the

future trend of DRAM.

2.3.1 Bandwidth

The improvement of DRAM bandwidth has never satisfied the increasingly

complicated application such as multimedia and 3D processing. To fulfill the demand

for high bandwidth, various new DRAM specifications have been announced by

DRAM manufacturers. The SDRAM standards supported by JEDEC [18] have

become the mainstream of DRAM market. Several techniques have been applied on

the latest standards announced by JEDED to provide users higher bandwidth.

Fig. 2.11 Operating frequency of SDR, DDR-1, DDR-2

Fig. 2.11 shows the operating frequency of each component of SDR, DDR-1, and

DDR-2. As we can see, in SDR, the core and the I/O are running at the same

(26)

data PREFETCH technique is applied. Data is transferred at positive and negative

edge of clock. The data rate of DDR-1 is twice as SDR. In DDR-2, the PREFETCH

technique is further promoted up to 4-bit. The I/O frequency is twice as DDR-1, so

DDR-2 can provide bandwidth twice as DDR-1 can. The PREFETCH technique

makes DRAM be able to provide quadruple bandwidth than SDR with core frequency

remains unchanged.

In SDR and DDR-1, row activation takes a period of time called tRCD before

column access command can be issued. This may induce overhead due to the bus

confliction. For example, Fig. 2.12 shows the difference between DDR and DDR-2.

The top part is the timing diagram of DDR. Because of the bus confliction The third

active command can not be issued at 4th cycle. It must wait a cycle therefore make a

bubble on the data bus. The bottom part of Fig. 2.12 is the timing diagram of DDR-2.

DDR-2 allows users to issue subsequent read command right after an active command.

The read command will be buffered inside DDR-2 and will not be processed until the

tRCD latency is reached. Thus the bus confliction is prevented, the bubble is

eliminated and the bandwidth utilization becomes better.

(27)

2.3.2 Latency

The DRAM response latency can directly influence the speed of the whole

system. The speed of the system for the multimedia process is very essential to

achieve the real-time request. So if the DRAM latency is shorter, the whole system

can boost its performance. However, the situation is not as we expected. Fig. 2.13

compares the performance trend of CPU and DRAM. While CPU clock speed

increases 7.65 times, DRAM latency also has a 4.6 times increase. The improvement

of CPU is much faster than the improvement of DRAM. Long response latency waste

its processing power on waiting and the performance is limited.

Fig. 2.13 CPU VS DRAM performance

Since Year 2000, DRAM manufacturers have begun to pay attention to the

impact of DRAM latency and proposed several solution including RLDRAM (reduce

latency DRAM) [19], FCRAM (Fast cycle RAM) [20], and Network DRAM [21] etc.

In the following we will introduce RLDRAM for example.

RLDRAM-1 which mainly aimed at application requiring short latency and high

bandwidth was announced by micron in 2001. RLDRAM-2 was announced in 2003

and can be taken as an enhanced version of RLDRAM-1. Some features if

RLDRAM-2 are listed as followed.

(28)

banks. The successive accesses to the same bank cost more latency than the

successive accesses to the different banks, shows in Fig. 2.14 and Fig. 2.15. So if the

number of banks increases, the rate of accessing different banks increases. So the

latency of RLDRAM is shorter than the traditional DRAM. On the other hand, Row

cycle time (tRC) defined the shortest time interval needed for two successive ‘active’

commands addressed to the same bank. The row cycle time is 20ns in RLDRAM and

72ns in RDRAM respectively. When the row miss happens, RLDRAM can apparently

save much access latency.

RLDRAM has another advantage of saving latency. It separates data bus for read

and write data, RLDRAM-2 can effectively reduce the latency caused by the

turnaround cycles.

Fig. 2.14 Accesses addressed to same bank

(29)

2.3.3 Power

In many application of portable wireless devices such as mobile and PDA, power

consumption is the significant issue because of battery life is limited. With the

application of multimedia becomes popular, the request of memory size is larger.

Accordingly, the designers often select DRAM to be the body memory component.

In order to reduce the power of DRAM, many products have been invented for

low power such as BAT-RAM from micron [22] and Mobile-RAM from Infineon [23].

The low-power DRAM has some special features inside.

Low Operating voltage

Compare with SDR SDRAM, the operating voltage of low-power DRAM is

lowered from 3.3v to 1.8v. Thus, the power consumption can significantly decreases.

Output Driver Strength

Because the low-power DRAM is designed for use in smaller systems that are

typically point-to-point connection, an option to control the drive strength of the

output buffers is provided. Drive strength should be selected based on expected

loading of the memory bus. There are four allowable setting for the output drivers,

including full strength driver, half strength driver, quarter strength driver, and

one-eighth strength driver.

Temperature Compensated Self Refresh (TCSR)

Most of the time mobile devices stay in standby mode and DRAM can enter sef

refresh mode to save unnecessary power consumption. In the self-refresh mode,

DRAM will refresh the data stored in the DRAM cell. The refresh period is inversely

proportional to temperature, traditional DRAM can only support single refresh period

(30)

implemented for auto control of the self refresh oscillator on the device. Therefore,

the refresh current is decreasing while the temperature is low, as shown in table 2.2.

Table 2.2 Current of different condition

Partial Array Self Refresh

For further power savings during SELF REFRESH, the PASR feature enables the

control to select the amount of memory that will be refreshed during SELF REFRESH.

The option is shown in Table 2.3. The current can be reduced as shown in Table 2.1.

(31)

Stopping the external clock

One method of controlling the power efficiency in applications is to throttle the

clock that controls the SDRAM. There are two basic ways to control the clock:

1. Change the clock frequency, when the data transfers require a different rate of

speed.

2. Stopping the clock altogether.

Both of these are specific to the application and its requirements and both allow

power savings due to possible fewer transitions on the clock path.

The clock can also be stopped altogether if there are no data accesses in progress,

either WRITE or READ, that would be affected by this change; i.e., if a WRITE or a

READ is in progress, the entire data burst must be through the pipeline prior to

stopping the clock.

For the full duration of the clock stop mode. One clock cycle and at least one

NOP is required after the clock is restarted before a valid command can be issued. Fig.

2.16 illustrates the clock stop mode.

It is recommended that the DRAM be in a pre-charged state if any changes to the

clock frequency are expected. This will eliminate timing violations that may

otherwise occur during normal operations.

(32)

Power-Down

Power down can occurs when all banks idle, this mode is referred to as precharge

power-down. If power down occurs when there is a row active in the bank, this mode

is referred as active power-down. Entering power-down mode deactivates all input

and output buffers, therefore the power is saved.

Deep Power-Down

Deep power down is an operating mode used to achieve maximum power

reduction by eliminating the power of the memory array. Data will not be retained

when the device enters power-down mode. Since DRAM is often used as temporary

data buffers, enter DPD mode while the device is in standby mode won’t cause any

loss.

2.4 Future trend of DRAM

DDR-3

Beside DDR-2, the draft of DDR-3 industry standard has been announced by

JEDEC in Oct 2002. DDR-3 SDRAM which is due to enter volume production in

2006, operates from a 1.5 volts supply and can transfer data at up to 1,066-Mbits per

second. That's twice as fast as DDR2, which is just coming into the market in volume,

and four times the speed of DDR. Samsung said it plans to make the chips using an 80

nanometer manufacturing process and, citing data from research house IDC, believes

(33)

DDR3 has some features that can improve the performance and more

cost-effective [24]. In the 8b-prefetch data-path with 3-stage pipelining, a newly

devised hybrid-type latency control scheme and a 2-step multiplexing can proficiently

handle maximum 128b parallel data in the case of x16 configuration. An efficient

protocol for the temperature read-out is proposed, supporting CPU while it controls

the heat in high-speed operations. Per-bank-refresh is another experimental feature of

this prototype, virtually removing the loss of the memory bandwidth due to the

unavoidable requirement of refresh operation common in all DRAM.

Embedded DRAM

Embedded DRAM (EDRAM) technology has significant advantage in terms of

performance, area, bandwidth, and power consumption by combining a high

bandwidth DRAM macro with logic/analog circuits on the same chip.

Embedded DRAM (EDRAM) macros have been proposed as a way to achieve

the low power and wide bandwidth required by graphic controllers, network systems,

and mobile systems. To give an example; Advanced 3D graphics (3DG) technology

will be used in console game machines [25], and it is desired to develop a rendering

controller chip which can handle real time 3D animation with true colors. Embedded

DRAM (EDRAM) [25] technology attracts attention of the 3DG systems, because

only EDRAM can satisfy the required data rate. Fig 2.17 shows the trend of 3DG

controller performance.

In the portable device such as mobile and PDA, the power consumption is a very

important issue. In off-chip design, the power consumption of off-chip

interconnection is larger one or two orders of magnitude than that of standalone

(34)

on-chip interconnection is reduced to the same order of that of EDRAM because

on-chip I/O capacitance is tremendously reduced. Therefore the power consumption

of connection can be saved.

Although embedded DRAM technology has significant advantage in SOC design

because of its performance such as low-power consumption and high bandwidth.

However, process gap between commodity DRAM and logic make it difficulty to

make highly reliable EDRAM product with low cost, and high yield. There are still

other problems need to overcome in the design of EDRAM. But the implementation

of EDRAM technology would be a trend in the design of SOC.

(35)

Chapter 3 Transaction Level Modeling of

H.264/AVC Decoder

The proposed controller is designed for the video process which need large

amount of data storage. We have a project to design a hardware architecture for the

H.264 decoder that confirms to high profile at Level 4 (HP@L4). As a result of the

complexity of the H.264 decoder, it needs a memory controller to communicate with

external memory efficiently. We have a cooperation to integrate my memory

controller into the H.264 decoder system design. The characteristic of H.264 and the

system design of this H.264 decoder will be introduced in this chapter. The role of my

memory control in this H.264 decoder system will also be described here.

3.1 Overview of H.264

3.1.1 Introduction

H.264/AVC is the newest international video coding standard. Relative to prior

video coding methods such as MPEG-2 video, H.264/AVC has the higher coding

efficiency. With an increasing number of services and growing popularity of high

definition TV are creating greater needs for higher coding efficiency. Moreover, other

transmission media such as Cable Modem, XDSL, or UMTS offer much lower data

rates than broadcast channels, and enhanced coding efficiency can enable the

(36)

existing digital transmission capacities.

The scope of the standardization is illustrated in Fig. 3.1, which shows the

typical video coding/decoding chain (excluding the transport or storage of the video

signal). Only the central decoder is standardized. By imposing restrictions on the

bit-stream and syntax, and defining the decoding process of the syntax elements such

that every decoder conforming to the standard will produce similar output when given

an encoded bit-stream that conforms to the constraints of the standard.

The new standard is designed for many application areas such as the following

example.

• Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc.

• Interactive or serial storage on optical and magnetic devices, DVD, etc.

•Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile

networks, modems, etc.

• Video-on-demand or multimedia streaming services over ISDN, cable modem, DSL,

LAN, wireless networks, etc.

• Multimedia messaging services (MMS) over ISDN, DSL, LAN, wireless and mobile

networks, etc.

(37)

3.1.2 Characteristic of H.264

The coding gain of H.264 is achieved by efficiently exploiting spatial and

temporal redundancy. For better temporal prediction, new coding tools such as

long-term prediction, multiple reference frames, motion compensation with variable

block size, in-loop filter, and 1/4-pel motion compensation are developed. In addition,

for exploiting spatial redundancy, an intra prediction technique is adopted. Further, to

reduce bit rate, a context-adaptive entropy coder is deployed. The following briefly

summarizes the features of each coding tool.

• Long-Term Prediction: The prediction of a picture can refer to a prior coded picture that is not right before the current one. For sequences with periodic content,

long-term prediction offers coding gain by having more flexibility on the selection of

reference picture.

• Motion Compensation with Variable Block Size and Multiple Reference

Frames: Motion compensation can be done by partitioning a macroblock into a few

number of sub-blocks and each sub-block can refer to a larger number of pictures that

have been coded and stored. The features of variable block size and multiple reference

frames offer better trade-off between texture and motion information as well as better

adaptation for macroblocks with varying characteristics.

• 1/4-pel and 1/8-pel Motion Compensation: The prediction can come from 1/4-pel samples (or 1/8-pel for chroma) that are generated by using the interpolation with

full-pel samples as input. The sub-pel motion compensation with higher accuracy

improves the prediction efficiency by reducing the aliasing from sampling.

• Intra Prediction: An intra-coded block can be predicted from the edges of the adjacent and previously-coded blocks. Particularly, the prediction can come from

(38)

different directions.

• Transform with Variable Block Size: The 4x4 integer transform and 8x8 DCT transform can be adaptively selected for a macroblock. The 4x4 integer transform can

remove ringing artifact while the 8x8 DCT provides higher coding efficiency for

smooth area. In addition, a double transform could be applied for the DC coefficients

belonging to the 16 4x4 blocks within a macroblock.

• Context-Adaptive Entropy Coding: The entropy coding is done in a context-adaptive manner. The value of prior coded syntax elements (or bins) could be

used to select the probability model or table for the coding of following syntax

elements (or bins). Higher coding efficiency is achieved by using conditional

probability models.

• In-loop De-Blocking Filter: A de-blocking filter is placed in the prediction loop to remove the blocking artifact for the reference picture so as to improve the quality of

the reference picture and prediction efficiency.

While more correlations are used for coding, it suggests that stronger data

dependency exhibit between successive computations and more buffers are required.

Moreover, the very different types of predictors imply that intensive computations are

inevitable. Also, the heterogeneous building blocks and operations bring new

challenges to a system design such as synchronization, data flow control, error

handling, buffering, software/hardware concurrency, and so on. With these design

challenges, the SoC implementation for H.264 codec becomes much more difficult

than prior coding standards. Due to the complexity of H.264, a proper top-level

(39)

regression would be time-consuming and the loss of cost is significant if the system

architecture has any errors. The following introduces a new SoC design philosophy,

transaction level modeling, which allows us to explore the design spaces at system

level by providing trade-off between implementation details and simulation accuracy.

3.2 Transaction Level Modeling

3.2.1 Introduction of TLM

As described in the previous section, the traditional design methodology can not

satisfy the need for the design of complex system. The reason is that many

unnecessary implementation details are captured for the system-level modeling. Thus,

the simulation speed could be so slow that the verification at system level may not be

done thoroughly. Recently, a modeling technique called transaction level modeling

(TLM) is proposed to achieve the system-level modeling. The idea is to make another

level of abstraction between the system specification and its RTL implementation so

that unnecessary implementation details can be hid from the system-level modeling.

As far as the system is concerned, the implementation details for each component is

not the most important in the early development phase. Instead, the system parameters,

such as the partition of the tasks, the functionality of each component, the topology

that connects different components, the communication protocol between components,

the memory hierarchy, and so on, are of more interest.

There are four types of TLM including (1) the PE-assembly model, (2) the

bus-arbitration model, (3) the cycle-accurate computation model, and (4) the

timing-accurate communication model.

(40)

level is the specification model and the bottom level is the implementation model. The

marked level between specification model and implementation model are transaction

levels. According to the modeling accuracy in computation and communication, each

model represents an operating point in Fig. 3.2 (b), where the bottom-left corner

stands for the specification of the system while the top-right corner denotes the

detailed implementation at register-transfer level. Particularly, only the four modules,

PE-assembly model, bus-arbitration model, time-accurate communication model, and

cycle-accurate computation model, are considered as the TLM.

specification model PE-assembly model Bu s-arbitration model Time-accurate communication model Cycle-accurate computation model Implementation model More Accurate Computation C o m m u n icat io n More Accurate Un-Timed Un-Timed Approxiate -Timed Approxiate -Timed Cycle-Timed Cycle-Timed SM TLM TLM TLM TLM RTL (1) (2) (3) (4) (1) (2) (3) (4)

(a)

(b)

Fig. 3.2 System model at different levels of abstraction

Table 3.1 indicates the characteristics of different system models. As shown,

different models capture different degrees of accuracy in computation and

communication. The specification model and the implementation model represent the

(41)

The models in between are the four types of TLM, which offers the flexibility on

selecting the simulation accuracy and speed.

Table 3.1 Characteristics of different models

Models communication time computation time

communicatoin

schene PE interface

added Implementation detail

Specificaton model no no variable (no PE)

-PE-assembly model no approximate message-passing

channel abstract

PE allocation, process PE mapping

Bus-transaction model approximate approximate abstract bus

channel abstract

bus topology, bus arbitration

Time-accurate

communication model time/cycle accurate approximate

detailed bus

channel abstract detailed bus protocol

Cycle-accurate

computation model approximate cycle accurate

abstract bus

channel pin accurate RTL/ISS PEs

Implementation model cycle accurate cycle accurate wire pin accurate detailed bus protocol or RTL/ISSPEs

3.2.2 Design flow with TLM

Traditional design flow can not ensure the quality of the design when the system

complexity increases dramatically. This section presents a new SoC design flow with

TLM as the platform for concurrent software and hardware development. The new

design flow mainly comprises two parts, which are (1) the new system-to-RTL

extension and (2) the traditional RTL-to-layout flow. The first part is different from

(42)

Requirement Definition Specification Development

Specification Model

System Architecture and TLM Development TLM RTL HW Refinement SW Design and Development HW Verification Environment Development

Fig. 3.3 Design floe with transaction level modeling

Fig. 3.3 indicates the new system-to-RTL extension. As shown, after the

specification is defined, the system architecture is developed and verified by using

TLM. Upon the completeness of the TLM model, it is used as a unique reference to

both software and hardware teams. For the software team, the embedded software is

developed and verified based on the TLM model. For the hardware team, the TLM

serves as the golden model for the detailed implementation. Along with the

development of software and hardware, the TLM model can be annotated with more

accurate timing information. Consequentially, not only the functionality but also the

timing can be verified together. Differs from the traditional design flow, the new

(43)

is the key for ensuring the quality of the design. The following summarizes the

functionality of TLM in the SoC design flow:

1. Verification model for design space exploration.

2. Platform for early software development.

3. Specification and golden model for hardware development.

Nowadays, EDA tools are still not capable of automatically converting TLM to

detailed hardware implementation. The hardware refinement is still done through a

traditional paper specification and RTL coding. TLM appears to be an extra workload

and unnecessary task. However, it still brings many benefits that significantly reduce

the time to market:

1. System integration at the early stages so that the potential problems can be found

and solved earlier.

2. Faster simulation speed while maintaining the accuracy of simulation.

3. Concurrent software and hardware development.

4. Platform for software/hardware co-design and co-verification.

5. Incremental hardware refinement and implementation details by means of hybrid

abstraction level modeling.

For the implementation of TLM, we present the System C library. In the System

C, there are 3 major components, which are process, channel, and interface function.

The processes define the operations of a component and can be triggered by a set of

predefined events. In addition, the channel specifies the connections among different

components and the interface function provides the means for a component to

communicate with the others. Within a component, the processes can call the interface

(44)

interface functions. Consequentially, by well defining the interface functions, the

communication part and the computation part can be developed and refined

independently.

Due to the benefits of the TLM, it will become a critical step in the SoC design

flow. In the next section, the system architecture of our H.264 video decoder that is

designed by using will be introduced.

3.3 Transaction Level Modeling of H.264 Decoder

In most video coding standard, different profile level will support different

coding tools such as transform8x8, supporting frame size, and MBAFF. As a result,

the profile level definition is very important to the complexity and cost of the overall

system architecture. This section will describe the profile level of our H.264 decoder

first. The system architecture designed in TLM model will be introduced next.

3.3.1 Specification

In H.264 standard, there are many different profile levels which contain different

coding tools to improve the coding efficiency. Thus, different decoder design

supporting for differnet profile will be different in performance and cost. This section

presents a design specification of decoder conforming to high profile at level 4. Any

bit-streams conforming to main/high profile with a level lower than or equal to 4 shall

be decoded. Specifically, the decoder supports the decoding throughput up to

1920x1080i@60Hz. In the following, some properties of high profile and level limits are listed.

(45)

2. No data partition.

3. Arbitrary slice order is not allowed.

4. No slice group and no redundant picture.

5. chroma_format_idc in the range of 0 to 1.

6. bit_depth_luma_minus8/bit_depth_chroma_minus8 equal to 0 only.

7. qpprime_y_zero_transform_bypass_flag equal to 0 only.

8. Up to 16 reference frames. (32 reference fields).

9. Vertical motion vector range does not exceed MaxVmvR as in Table 3.1.

10. Horizontal motion vector range does not exceed the range of -2048 to 2047.75

11. Up to 32MVs perMB.

12. Number of bits per macroblock is not greater than 3200.

Moreover, Table 3.2 shows more constraints of different profiles. With

summarizing these profile limits, we can start to design our micro-architecture of

H.264 decoder to satisfy all functionalities while minimizing the cost.

(46)

3.3.2 System Architecture of H.264 decoder

Fig. 3.4 shows the overall architecture of this system, which is developed based

on the ARM platform. For the chip I/O, the compressed bit-stream is input via a

hardware interface, which communicates with the host by a bridge, and the decoded

frames are output to the monitor via HDMI interface. The reference pictures, decoded

pictures, and motion vectors for each reference picture are stored in the external

memory. All the data access to external memory will go through the memory

controller.

32-bit AHB Control Bus

Memory Controller S SDRAM 0 CABAC CAVLC S

128-bit AHB Data Bus

Bit-stream FIFO ARM 9 CPU M Instruction Memory Data Memory IQ/IDCT S MB Texture Buffer MB Motion Buffer Data Fetch S Intra/Inter Prediction S Subblock Reconstruct Buffer DeBlocking S,M IIP FIFO DB FIFO DeInterlacer S,M DI FIFO

SDRAM 1 SDRAM 2 SDRAM 3

Harddware Input Interface

M,M Sync FIFO HDMI Interface S u b b l o c k P r o c e s s i n g U n i t NAL Parsing

Fig. 3.4 System architecture diagram

Inside the chip, there is an embedded CPU and two AHB buses, which are

(47)

data bus is used by Data Fetch, De-blocking and De-interlacer for data transfer

between these modules and external memory. In addition to the AHB buses, there are

backdoor-to-backdoor connections between modules. The modules connected by

backdoor channel make up a video pipe, where its input comes from bit-stream FIFO

and its output is drive to the HDMI interface. Particularly, the data between modules

are exchanged on block by block basis with block size being 8x8 except for CABAC.

In this system, the headers of sequence, picture, and slice are parsed by CPU.

The data below slice level are processed in the video pipe. According to the

information in sequence, picture, and slice headers, CPU can configure the modules in

the video pipe through the control bus for various decoding modes. Each module can

also be independently tested by CPU. The following briefly describes our video pipe

and system schedule.

3.3.2.1 Video Pipe

In our H.264 decoder, our video pipe contains seven modules which are CABAC,

IQ/IDCT, Data Fetch (DF), Intra-Inter prediction (IIP), De-Blocking, and

De-Interlacer. CABAC is the first module in video pipe and the functionality of

CABAC is decoding the bit-stream syntax below slice header. Because our decoder

processes luma and chroma components in parallel and CABAC can not decode

chroma components until all luma coefficients have decoded in one macroblock, it is

more efficient to make CABAC operate in macroblock level. For saving the buffer

size, all other modules operate in 8x8 block.

After CABAC the pipeline process to IQ/IDCT module and DF module. The

IQ/IDCT module does the inverse quantization and inverse discrete cosine transform

(48)

reference block for Inter block and intra prediction type decoding for Intra block.

Following after the DF is IIP which produces the value of prediction block for Intra

and inter prediction and adds the results with the residuals from IQ/IDCT. Because we

let IQ/IDCT start earlier one eight by eight block than IIP, it can make sure that IIP

always has the corresponding residuals.

After that, de-blocking is performed for reducing the blocking effect. For

keeping the data order correct, there are three eight by eight block buffers between

IQ/IDCT, IIP, and De-Block. For example, when IQ/IDCT is writing the third buffer

and IIP is writing the second buffer, De-Block is reading the first block buffer. In

addition, because de-blocking has specific process order in macroblock level but our

process unit is eight by eight block, we must change the process order shown in

Figure 3.2 but still keep the rule in specification. In Figure 3.2, the number means the

process order in one block. Fig. 3.5 present the eight by eight block order in zig-zag

scan of one macroblock.

3

1

5

6

2

4

2

1

5

3

8

4

6

9

7

10

3

4

1

2

5

6

4

8

1

2

5

3

6

9

7

10

Fig. 3.5 De-blocking process order

The last module is De-Interlacer which will work only when the source sequence

is field. The functionality of De-Interlacer is to translate a field picture to a frame

picture. Because the algorithm of De-Interlacer will use the previous, current, and

next one field in display order, it will also cost bus bandwidth. Besides, the fields for

(49)

size of external memory to store the data.

3.3.2.2 System Schedule

Without a system schedule to control data flow in video pipe, it is possible that

the functionality of whole system is wrong even if every module is well verified. This

section describes briefly our system schedule shown in Table 3.3. In the beginning of

decoding, we need an initial period to decoding bit-stream above slice header and to

set the control registers of every hardware modules by ARM CPU. After that, all

hardware modules start to decode one by one. When CABAC starts to decode the

second macroblock in current slice, the first macroblock is fed into the following

modules in eight by eight block unit. As a result, IQ/IDCT and DF, IIP, and DeBlock

process the different eight plus eight block.

During the slice changes, the CPU will detect the NAL unit first and start to

decode the slice header. When CPU is decoding the slice, the hardware may stall

depending on the time the NAL unit is detected, slice header decoding speed of CPU,

and block decoding speed of hardware modules. Table 3.3 shows the condition that

hardware modules will stall. In addition, when the picture changes, there is same

condition as slice.

Looking into the hardware pipe, every module has different process time. As a

result, when a module finish current block, it does not mean that the module can

process next block right now because the input data may be not ready and it may over

write the data which next stage is using. In our decoder design, if one module is

finished, it will not start again until all hardware modules are finished. Thus, for

supporting real-time decoding, all modules must make sure that they can finish their

(50)

(51)

3.3.2.3 System Modeling

For transaction design at transaction level, Fig. 3.6 shows the concept about how

we implement all modules with sc_thread. First, we define the input and output ports

connecting with other modules. All data transmission is through channel by calling

the interface functions of ports. In the program part, the module will read input data

through input channel in the beginning. Then, the finite state machine decides which

path th program will go through. In the main program, the functionality can be

composed of one or more functions and we can decide to use wait function after the

end of every function or only use once after all functions are finished. The more wait

function we use, the more program switches happen. That will influence the

simulation speed. After all work is done, the module will send a finish signal to

synchronization channel and stall until all modules finish their work.

DeBlockPP CABAC IQIDCT IIP0 IIP1 DeBlock DeInterlace AHB AHB DeInterlacePP CH2 CH3 CH4 CH5 CH 6 CH 7 ARM

(52)

3.4 Memory Controller in H.264 Decoder

There are three main modules in the H.264 decoder will access external memory

to receive or give data. These three modules are inter prediction (motion

compensation) module, de-blocking module, and de-interlace module. The

relationship between these modules and external memory is shown in Fig. 3.7.

AMBA BUS M emory Controller BUFFE R Inter block(MC ) BUFFE R De-block BUFFE R De-interlace DRAM pad

External memory

Fig. 3.7 Relationship between H.264 decider and external memory

The functionality of de-block module is writing the re-construct block into the

external memory, all the decoded block will be stored in the external memory through

de-blocking module.

The inter prediction module is used to do the motion compensation for the

inter-block. The data fetch module obtains the motion vectors from CABAC block

(53)

first-in-first-out (FIFO) buffer before entering the inter prediction module.

In order to display the picture in field mode, a de-interlace module is need to

read the decoded field stored in the external memory and de-interlace the fields.

The H.264 decoder needs a memory controller to communicate with external

memory. Due the complexity of H.264/AVC decoder at level 4 we need a memory

controller that not only keep the correctness of data but also can help the whole

system to achieve better performance.

Fig. 3.8 Architecture of the memory controller in H.264 decoder

Fig. 3.8 shows the architecture of my memory controller inside the H.264

decoder. This memory controller combines two layers together. The Layer 0 is the

external memory interface (EMI) which is used to control the external memory. The

(54)

decoder. The Layer 0 part (EMI) will receive the addresses translated from Layer 1

(ATM) part and generate the appropriate commands to external memory.

The purpose of layer 0 (EMI) is to communicate with external memory. The EMI

is designed for various kind of system. As a result, the EMI is independent of the

system but dependent of the external memory. In order to make this EMI can fit

variable type of external memory. The configurability of this EMI is very essential.

The detail of EMI will be introduced in the Chapter 4.

Layer 1 is an address translation machine (ATM). It exploits the characteristics

of the video process to make the accesses become more regular. The Layer 1 is

designed for the memory subsystem. The detail of Layer 1 and the relationship

(55)

Chapter 4 External Memory Interface

In the design of SOC system, it usually needs an off-chip memory to storage the

large amount of data. The external memory interface is used to communicate with the

external memory for the system as shown in Fig. 4.1. To deal with tremendous data

transfer and storage in H.264 decoder, the external memory must provide high data

bandwidth to achieve the real time request. The bandwidth of the external memory is

limited due to the pin number of I/O is finite. Accordingly the external memory

interface must provide high data bandwidth utilization by using some techniques. The

proposed external memory interface will be introduced in this chapter.

(56)

4.1 Concept of EMI

The external memory interface (EMI) is one of the Layers in this memory

controller, that is, layer 0. EMI will receive the physical addresses from the address

translation machine or the functional block, and use the addresses to access the

DRAM. This external memory interface is designed to control the external memory,

and it will directly connect to the DRAM as show in Fig. 4.2.

Address & Write

data Layer 0 : EMI

Addr/cmd

DRAM

Data

Fig. 4.2 Connection of EMI

The external memory interface will generate the appropriate commands to

external memory (DRAM). The external memory can only accept the commands that

it defined in the specification, and the commands have been introduced in chapter2.

The design of my EMI is configurable for different kinds for external memory.

Users only have to set the specific parameter of the external memory that they use,

and this EMI will translate the accesses into the commands. Thus the users can

conveniently use this EMI to access the external memory (DRAM).

(57)

restriction on the design of SOC. In order to increase the performance of the whole

system this EMI is designed to achieve higher performance. The whole architecture

will be introduced in next section.

4.2 Architecture of EMI

The detail architecture is shown in Fig. 4.3. It contains five fundamental parts

inside the EMI. The input data and address will stored in the FIFO and generate the

appropriate command through the mode control block and the FSM block. Then the

commands will be send to DRAM. The detail of these parts will be described in the

following. Command FIFO W-Data FIFO (BA,ROW,COL,BL) Write Data (BA,ROW.COL,BL)

FSM

WRITE Data Mode control Timing

Checker NOP Counter

Command DRA M INTE RFA CE

External

Memory

ADDR DATA WRITE Data Scheduler R-Data FIFO

Read data Data

適用於H.264解碼器的可調式雙層外部記憶體管理器

國

立

交

通

大

學

電子工程學系 電子研究所碩士班

碩

士

論

文

適用於 H.264 解碼器的可調式雙層外部記憶體管理器

A Flexible Two-Layer External Memory Management for

H.264/AVC Decoder

研 究 生：張長軒

指導教授：黃 威 教授

適用於 H.264 解碼器的可調式雙層外部記憶體管理器

研 究 生：張長軒 Student：Chang-Hsuan Chang

指導教授：黃 威 教授 Advisor：Prof. Wei Hwang

國 立 交 通 大 學

電 子 工 程 學 系 電 子 研 究 所

碩 士 論 文

適用於

H.264 解碼器的可調式雙層記憶體管理器

學生：張長軒 指導教授：黃 威 教授

國立交通大學電子工程學系電子研究所

摘 要

A Two-Layer Flexible External Memory Management for

H.264/AVC Decoder

Student : Chang-Hsuan Chang

Advisors : Prof. Wei Hwang

Department of Electronics Engineering & Institute of Electronics

National Chiao-Tung University

ABSTRACT

Acknowledgements

Contents

Chapter 1 Introduction...1

Chapter 2 Overview of External Memory Organization ...5

Chapter 3 Transaction Level Modeling of H.264/AVC Decoder ...25

Chapter 4 External Memory Interface ...45

Chapter 5 Address Translation Machine & Memory Subsystem for

H.264 Decoder ...71

Chapter 6 Conclusions and Future work ...101

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Background

1.2 Motivation

1.3 Organization

Chapter 2

Overview of External Memory

Organization

2.1 DRAM characteristic

2.1.1 DRAM architecture

2.1.2 DRAM command and operation

2.2 Past DRAM controller techniques

2.2.1 Techniques and Improvement

2.3 Modern DRAM Development

2.3.1 Bandwidth

2.3.2 Latency

2.3.3 Power

2.4 Future trend of DRAM

Chapter 3

Transaction Level Modeling of

H.264/AVC Decoder

3.1 Overview of H.264

3.1.1 Introduction

3.1.2 Characteristic of H.264

3.2 Transaction Level Modeling

3.2.1 Introduction of TLM

(a)

(b)

3.2.2 Design flow with TLM

3.3 Transaction Level Modeling of H.264 Decoder

3.3.1 Specification

3.3.2 System Architecture of H.264 decoder

3.3.2.1 Video Pipe

3

電子工程學系電子研究所碩士班

研究生：張長軒

指導教授：黃威教授

研究生：張長軒 Student：Chang-Hsuan Chang

指導教授：黃威教授 Advisor：Prof. Wei Hwang

國立交通大學

電子工程學系電子研究所

碩士論文

學生：張長軒指導教授：黃威教授

摘要