Future trend of DRAM - Overview of External Memory Organization

Chapter 2 Overview of External Memory Organization

2.4 Future trend of DRAM

DDR-3

Beside DDR-2, the draft of DDR-3 industry standard has been announced by JEDEC in Oct 2002. DDR-3 SDRAM which is due to enter volume production in 2006, operates from a 1.5 volts supply and can transfer data at up to 1,066-Mbits per second. That's twice as fast as DDR2, which is just coming into the market in volume, and four times the speed of DDR. Samsung said it plans to make the chips using an 80 nanometer manufacturing process and, citing data from research house IDC, believes DDR3 will account for 65 percent of the market in 2009.

DDR3 has some features that can improve the performance and more cost-effective [24]. In the 8b-prefetch data-path with 3-stage pipelining, a newly devised hybrid-type latency control scheme and a 2-step multiplexing can proficiently handle maximum 128b parallel data in the case of x16 configuration. An efficient protocol for the temperature read-out is proposed, supporting CPU while it controls the heat in high-speed operations. Per-bank-refresh is another experimental feature of this prototype, virtually removing the loss of the memory bandwidth due to the unavoidable requirement of refresh operation common in all DRAM.

Embedded DRAM

Embedded DRAM (EDRAM) technology has significant advantage in terms of performance, area, bandwidth, and power consumption by combining a high bandwidth DRAM macro with logic/analog circuits on the same chip.

Embedded DRAM (EDRAM) macros have been proposed as a way to achieve the low power and wide bandwidth required by graphic controllers, network systems, and mobile systems. To give an example; Advanced 3D graphics (3DG) technology will be used in console game machines [25], and it is desired to develop a rendering controller chip which can handle real time 3D animation with true colors. Embedded DRAM (EDRAM) [25] technology attracts attention of the 3DG systems, because only EDRAM can satisfy the required data rate. Fig 2.17 shows the trend of 3DG controller performance.

In the portable device such as mobile and PDA, the power consumption is a very important issue. In off-chip design, the power consumption of off-chip interconnection is larger one or two orders of magnitude than that of standalone DRAM itself. In SOC design, the situation is changed. The power consumption of

on-chip interconnection is reduced to the same order of that of EDRAM because on-chip I/O capacitance is tremendously reduced. Therefore the power consumption of connection can be saved.

Although embedded DRAM technology has significant advantage in SOC design because of its performance such as low-power consumption and high bandwidth.

However, process gap between commodity DRAM and logic make it difficulty to make highly reliable EDRAM product with low cost, and high yield. There are still other problems need to overcome in the design of EDRAM. But the implementation of EDRAM technology would be a trend in the design of SOC.

Fig. 2.17 The required data rate in 3DG engine

Chapter 3 Transaction Level Modeling of

H.264/AVC Decoder

The proposed controller is designed for the video process which need large amount of data storage. We have a project to design a hardware architecture for the H.264 decoder that confirms to high profile at Level 4 (HP@L4). As a result of the complexity of the H.264 decoder, it needs a memory controller to communicate with external memory efficiently. We have a cooperation to integrate my memory controller into the H.264 decoder system design. The characteristic of H.264 and the system design of this H.264 decoder will be introduced in this chapter. The role of my memory control in this H.264 decoder system will also be described here.

3.1 Overview of H.264

3.1.1 Introduction

H.264/AVC is the newest international video coding standard. Relative to prior video coding methods such as MPEG-2 video, H.264/AVC has the higher coding efficiency. With an increasing number of services and growing popularity of high definition TV are creating greater needs for higher coding efficiency. Moreover, other transmission media such as Cable Modem, XDSL, or UMTS offer much lower data rates than broadcast channels, and enhanced coding efficiency can enable the transmission of more video channels or higher quality video representations within

existing digital transmission capacities.

The scope of the standardization is illustrated in Fig. 3.1, which shows the typical video coding/decoding chain (excluding the transport or storage of the video signal). Only the central decoder is standardized. By imposing restrictions on the bit-stream and syntax, and defining the decoding process of the syntax elements such that every decoder conforming to the standard will produce similar output when given an encoded bit-stream that conforms to the constraints of the standard.

The new standard is designed for many application areas such as the following example.

• Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc.

• Interactive or serial storage on optical and magnetic devices, DVD, etc.

•Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile networks, modems, etc.

• Video-on-demand or multimedia streaming services over ISDN, cable modem, DSL, LAN, wireless networks, etc.

• Multimedia messaging services (MMS) over ISDN, DSL, LAN, wireless and mobile networks, etc.

Fig. 3.1 Scope of video coding standardization

3.1.2 Characteristic of H.264

The coding gain of H.264 is achieved by efficiently exploiting spatial and temporal redundancy. For better temporal prediction, new coding tools such as long-term prediction, multiple reference frames, motion compensation with variable block size, in-loop filter, and 1/4-pel motion compensation are developed. In addition, for exploiting spatial redundancy, an intra prediction technique is adopted. Further, to reduce bit rate, a context-adaptive entropy coder is deployed. The following briefly summarizes the features of each coding tool.

• Long-Term Prediction: The prediction of a picture can refer to a prior coded picture that is not right before the current one. For sequences with periodic content, long-term prediction offers coding gain by having more flexibility on the selection of reference picture.

• Motion Compensation with Variable Block Size and Multiple Reference Frames: Motion compensation can be done by partitioning a macroblock into a few number of sub-blocks and each sub-block can refer to a larger number of pictures that have been coded and stored. The features of variable block size and multiple reference frames offer better trade-off between texture and motion information as well as better adaptation for macroblocks with varying characteristics.

• 1/4-pel and 1/8-pel Motion Compensation: The prediction can come from 1/4-pel samples (or 1/8-pel for chroma) that are generated by using the interpolation with full-pel samples as input. The sub-pel motion compensation with higher accuracy improves the prediction efficiency by reducing the aliasing from sampling.

• Intra Prediction: An intra-coded block can be predicted from the edges of the adjacent and previously-coded blocks. Particularly, the prediction can come from

different directions.

• Transform with Variable Block Size: The 4x4 integer transform and 8x8 DCT transform can be adaptively selected for a macroblock. The 4x4 integer transform can remove ringing artifact while the 8x8 DCT provides higher coding efficiency for smooth area. In addition, a double transform could be applied for the DC coefficients belonging to the 16 4x4 blocks within a macroblock.

• Context-Adaptive Entropy Coding: The entropy coding is done in a context-adaptive manner. The value of prior coded syntax elements (or bins) could be used to select the probability model or table for the coding of following syntax elements (or bins). Higher coding efficiency is achieved by using conditional probability models.

• In-loop De-Blocking Filter: A de-blocking filter is placed in the prediction loop to remove the blocking artifact for the reference picture so as to improve the quality of the reference picture and prediction efficiency.

While more correlations are used for coding, it suggests that stronger data dependency exhibit between successive computations and more buffers are required.

Moreover, the very different types of predictors imply that intensive computations are inevitable. Also, the heterogeneous building blocks and operations bring new challenges to a system design such as synchronization, data flow control, error handling, buffering, software/hardware concurrency, and so on. With these design challenges, the SoC implementation for H.264 codec becomes much more difficult than prior coding standards. Due to the complexity of H.264, a proper top-level architecture is the key to shorten design cycle and increase chances of first-time

regression would be time-consuming and the loss of cost is significant if the system architecture has any errors. The following introduces a new SoC design philosophy, transaction level modeling, which allows us to explore the design spaces at system level by providing trade-off between implementation details and simulation accuracy.

3.2 Transaction Level Modeling

3.2.1 Introduction of TLM

As described in the previous section, the traditional design methodology can not satisfy the need for the design of complex system. The reason is that many unnecessary implementation details are captured for the system-level modeling. Thus, the simulation speed could be so slow that the verification at system level may not be done thoroughly. Recently, a modeling technique called transaction level modeling (TLM) is proposed to achieve the system-level modeling. The idea is to make another level of abstraction between the system specification and its RTL implementation so that unnecessary implementation details can be hid from the system-level modeling.

As far as the system is concerned, the implementation details for each component is not the most important in the early development phase. Instead, the system parameters, such as the partition of the tasks, the functionality of each component, the topology that connects different components, the communication protocol between components, the memory hierarchy, and so on, are of more interest.

There are four types of TLM including (1) the PE-assembly model, (2) the bus-arbitration model, (3) the cycle-accurate computation model, and (4) the timing-accurate communication model.

Fig. 3.2 (a) shows the system models at different levels of abstraction. The top

level is the specification model and the bottom level is the implementation model. The marked level between specification model and implementation model are transaction levels. According to the modeling accuracy in computation and communication, each model represents an operating point in Fig. 3.2 (b), where the bottom-left corner stands for the specification of the system while the top-right corner denotes the detailed implementation at register-transfer level. Particularly, only the four modules, PE-assembly model, bus-arbitration model, time-accurate communication model, and cycle-accurate computation model, are considered as the TLM.

specification model Fig. 3.2 System model at different levels of abstraction

Table 3.1 indicates the characteristics of different system models. As shown, different models capture different degrees of accuracy in computation and communication. The specification model and the implementation model represent the two extreme cases, where the system model specifies the functionality of the system

The models in between are the four types of TLM, which offers the flexibility on selecting the simulation accuracy and speed.

Table 3.1 Characteristics of different models

Models communication time computation time

communicatoin

schene PE interface added Implementation detail

Specificaton model no no variable (no PE)

-PE-assembly model no approximate message-passing

channel abstract PE allocation, process PE mapping

Bus-transaction model approximate approximate abstract bus

channel abstract bus topology, bus arbitration Time-accurate

communication model time/cycle accurate approximate detailed bus

channel abstract detailed bus protocol Cycle-accurate

computation model approximate cycle accurate abstract bus

channel pin accurate RTL/ISS PEs Implementation model cycle accurate cycle accurate wire pin accurate detailed bus protocol

or RTL/ISSPEs

3.2.2 Design flow with TLM

Traditional design flow can not ensure the quality of the design when the system complexity increases dramatically. This section presents a new SoC design flow with TLM as the platform for concurrent software and hardware development. The new design flow mainly comprises two parts, which are (1) the new system-to-RTL extension and (2) the traditional RTL-to-layout flow. The first part is different from that used in the past while the second part is remained the same.

Requirement Definition

Fig. 3.3 Design floe with transaction level modeling

Fig. 3.3 indicates the new system-to-RTL extension. As shown, after the specification is defined, the system architecture is developed and verified by using TLM. Upon the completeness of the TLM model, it is used as a unique reference to both software and hardware teams. For the software team, the embedded software is developed and verified based on the TLM model. For the hardware team, the TLM serves as the golden model for the detailed implementation. Along with the development of software and hardware, the TLM model can be annotated with more accurate timing information. Consequentially, not only the functionality but also the timing can be verified together. Differs from the traditional design flow, the new design flow performs system integration and verification in the very beginning, which

is the key for ensuring the quality of the design. The following summarizes the functionality of TLM in the SoC design flow:

1. Verification model for design space exploration.

2. Platform for early software development.

3. Specification and golden model for hardware development.

Nowadays, EDA tools are still not capable of automatically converting TLM to detailed hardware implementation. The hardware refinement is still done through a traditional paper specification and RTL coding. TLM appears to be an extra workload and unnecessary task. However, it still brings many benefits that significantly reduce the time to market:

1. System integration at the early stages so that the potential problems can be found and solved earlier.

2. Faster simulation speed while maintaining the accuracy of simulation.

3. Concurrent software and hardware development.

4. Platform for software/hardware co-design and co-verification.

5. Incremental hardware refinement and implementation details by means of hybrid abstraction level modeling.

For the implementation of TLM, we present the System C library. In the System C, there are 3 major components, which are process, channel, and interface function.

The processes define the operations of a component and can be triggered by a set of predefined events. In addition, the channel specifies the connections among different components and the interface function provides the means for a component to communicate with the others. Within a component, the processes can call the interface functions for data transaction without knowing the detailed implementation of the

interface functions. Consequentially, by well defining the interface functions, the communication part and the computation part can be developed and refined independently.

Due to the benefits of the TLM, it will become a critical step in the SoC design flow. In the next section, the system architecture of our H.264 video decoder that is designed by using will be introduced.

3.3 Transaction Level Modeling of H.264 Decoder

In most video coding standard, different profile level will support different coding tools such as transform8x8, supporting frame size, and MBAFF. As a result, the profile level definition is very important to the complexity and cost of the overall system architecture. This section will describe the profile level of our H.264 decoder first. The system architecture designed in TLM model will be introduced next.

3.3.1 Specification

In H.264 standard, there are many different profile levels which contain different coding tools to improve the coding efficiency. Thus, different decoder design supporting for differnet profile will be different in performance and cost. This section presents a design specification of decoder conforming to high profile at level 4. Any bit-streams conforming to main/high profile with a level lower than or equal to 4 shall be decoded. Specifically, the decoder supports the decoding throughput up to 1920x1080i@60Hz. In the following, some properties of high profile and level

limits are listed.

1. Only I, P, and B slice types may be present.

2. No data partition.

3. Arbitrary slice order is not allowed.

4. No slice group and no redundant picture.

5. chroma_format_idc in the range of 0 to 1.

6. bit_depth_luma_minus8/bit_depth_chroma_minus8 equal to 0 only.

7. qpprime_y_zero_transform_bypass_flag equal to 0 only.

8. Up to 16 reference frames. (32 reference fields).

9. Vertical motion vector range does not exceed MaxVmvR as in Table 3.1.

10. Horizontal motion vector range does not exceed the range of -2048 to 2047.75 11. Up to 32MVs perMB.

12. Number of bits per macroblock is not greater than 3200.

Moreover, Table 3.2 shows more constraints of different profiles. With summarizing these profile limits, we can start to design our micro-architecture of H.264 decoder to satisfy all functionalities while minimizing the cost.

Table 3.2 Level limits

3.3.2 System Architecture of H.264 decoder

Fig. 3.4 shows the overall architecture of this system, which is developed based on the ARM platform. For the chip I/O, the compressed bit-stream is input via a hardware interface, which communicates with the host by a bridge, and the decoded frames are output to the monitor via HDMI interface. The reference pictures, decoded pictures, and motion vectors for each reference picture are stored in the external memory. All the data access to external memory will go through the memory controller.

SDRAM 1 SDRAM 2 SDRAM 3

Harddware Input Interface

Fig. 3.4 System architecture diagram

Inside the chip, there is an embedded CPU and two AHB buses, which are control bus and data bus. The control bus is used by CPU for data flow control and the

data bus is used by Data Fetch, De-blocking and De-interlacer for data transfer between these modules and external memory. In addition to the AHB buses, there are backdoor-to-backdoor connections between modules. The modules connected by backdoor channel make up a video pipe, where its input comes from bit-stream FIFO and its output is drive to the HDMI interface. Particularly, the data between modules are exchanged on block by block basis with block size being 8x8 except for CABAC.

In this system, the headers of sequence, picture, and slice are parsed by CPU.

The data below slice level are processed in the video pipe. According to the information in sequence, picture, and slice headers, CPU can configure the modules in the video pipe through the control bus for various decoding modes. Each module can also be independently tested by CPU. The following briefly describes our video pipe and system schedule.

3.3.2.1 Video Pipe

In our H.264 decoder, our video pipe contains seven modules which are CABAC, IQ/IDCT, Data Fetch (DF), Intra-Inter prediction (IIP), De-Blocking, and De-Interlacer. CABAC is the first module in video pipe and the functionality of CABAC is decoding the bit-stream syntax below slice header. Because our decoder processes luma and chroma components in parallel and CABAC can not decode chroma components until all luma coefficients have decoded in one macroblock, it is more efficient to make CABAC operate in macroblock level. For saving the buffer size, all other modules operate in 8x8 block.

After CABAC the pipeline process to IQ/IDCT module and DF module. The IQ/IDCT module does the inverse quantization and inverse discrete cosine transform of the residuals while the DF module is responsible for motion prediction and fetching

reference block for Inter block and intra prediction type decoding for Intra block.

Following after the DF is IIP which produces the value of prediction block for Intra and inter prediction and adds the results with the residuals from IQ/IDCT. Because we

在文檔中適用於H.264解碼器的可調式雙層外部記憶體管理器 (頁 32-0)