Architecture of bandwidth-efficient Memory Controller

Chapter 3 A Bandwidth-efficient Motion Compensation

4.2 Memory Controller Organization

4.2.4 Architecture of bandwidth-efficient Memory Controller

In this section, we show the block diagram of bandwidth-efficient memory controller architecture for H.264/AVC in Figure 4.15. The dotted area between video decoder and SDRAM is our proposed memory controller. The bandwidth-efficient memory controller consists of the data buffer, command queue, bank controller, command arbiter, address translator, and memory interface scheduler (MIS). Besides, a flexible address generator is considered in our design, which prior to different modules of the video decoder. In H.264/AVC, only motion compensation, de-blocking filter and direct coding units require access data from/to external memory through memory controller. The direct coding unit reads co-located motion vector to perform direct coding prediction. The motion compensation reads pixels from SDRAM to interpolating current pixels. The de-blocking filter writes complete pixels into SDRAM. Major units will be introduced sequentially as follows.

Figure 4.15 Architecture of bandwidth-efficient memory controller for H.264/AVC

¾ Multiple-Channel Address Generator and scheduler

For H.264/AVC decoder in which more processing units (PUs) need to access SDRAM, there are three main PUs to require memory read/write accesses. Generally, single channel memory controller design is employed in most applications, the pressure of area and cost leads to designing a single, shared off-chip SDRAM. The connection approach to sharing a SDRAM has to carefully decide because it is highly related to SDRAM efficiency. Traditional memory controllers are often connected by shared-buses. Although area and cost may be economic, the shared-bus makes the SDRAM hard to provide sufficient SDRAM performance for the increasingly complicated applications. Another issue is how to perform different SDRAM requirements for latency and bandwidth of PUs in the decoder system. In addition to offering better performance for H.264/AVC decoder compared to single-channel SDRAM controller, multiple-channel SDRAM controllers also have the capability to schedule memory accesses from different channels to bit system requirement for SDRAM performance.

Figure 4.16 Multi-channel address generator

Since the memory controller can be applied to different modules in H.264/AVC decoder, we proposed multiple channels address generator and scheduler (MCAGS) to connect different PUs individually. The MCAGS can be used for several modules required. In H.264/AVC decoder, the MCAGS enables 3 channels for motion compensation, de-blocking filter, and direct mode coding module, which is shown in Figure 4.16. The MCAGS must to be provided to produce logical address prior to the memory controller. Due to different data types in motion compensation and direct mode coding, the MCAGS generates two kinds of logical addresses including pixels and motion vector addresses. The individual address is calculated according to the output ordering of the module. The output of address generator is sent into dynamic logical to physical address translator in memory controller after scheduling.

¾ Dynamic Logical to Physical Address Translator

For dynamic logical to physical address translator, the goal is that logical address is transferred to physical address (Row, Bank, Column) in SDRAM. The motion vector and the frame pixels are placed in different allocations, and the motion vector always allocated in the first partition. According to physical addresses, memory controller may read/write data in corresponding location.

¾ Command and address Queue

Due to long latency of SDRAM accesses, the module which issues a request may waste many cycles to wait data access. Therefore, a design avoids that decoder takes many cycles for waiting. Considering this reason, we design command queue to store incoming command including READ and WRITE commands from the decoder. The command queue can contain 7 READ or WRITE commands and sequential issue command into memory controller depending on incoming priorities. The command queue is a first-in-first-out structure according incoming priorities. The advantage of command queues is that the module needs one cycle to issue commands. Then, the module can do other processes but don’t waste additional cycles to wait data.

Row addr Column addr Bank addr operation Mode reset

Figure 4.17 Command and address queue and access status detection

Besides, the address queue is used to hold incoming addresses including (Bank, Row, Column), while a command queue is used to hold incoming commands. The decoder sends request and address to memory access controller when the status of read address queue is not full. The “full” signals reflect the status of this queue. The proposed address queue must also compare the incoming and the previous address command to check row-hit and bank-hit situations.

¾ Bank Controller

To fully utilize the SDRAM bandwidth and apply memory scheduling, it is necessary that the memory interface can process accesses addressed to different banks in a parallel way.

This work is performed by the bank controllers and the master bank controller together. The bank controller is illustrated as Figure 4.18.

Bank-0

Figure 4.18 The structure of bank controllers, master bank controller and timing unit.

Each internal bank of the SDRAM is allocated an individual bank controller to process accesses that are addressed to the bank. The master bank controller assigns the incoming address commands to suitable bank controller according to the access status. Scalable timing unit records all kinds of command latency such as burst length, tRP (precharge period), tRCD (ACTIVE to READ or WRITE), and so on. The parameter of scalable timing unit is defined by user in the initial setup. After accepting an access from the input port, the bank controllers generate sequential access commands according to the burst length and latency defined in a

scalable timing unit. These access commands are collected by master bank controller, which can issue the proper command to SDRAM. Read data buffer is used to hold sequential received read data for motion compensation. Write data buffer is used to hold the length of burst data. The arbiter allocates write / read data and command flow to / from external SDRAM memories according to the access operation.

Unlike traditional SDRAM access controller design containing various “WAIT” states, Lee’s [30] proposed a configurable shared-state FSM Design. This design merges all numerous “WAIT” state into single NOP stage. After applying NOP_count and NOP_code status registers, the FSM becomes flexible to parameterize the command latency without redesign FSM. We design our access FSM based on this concept. The interface connection between memory scheduler and bank controller is depicted in Figure 4.17.

Each bank requires individual access FSM to control command process, and to wait until the previous access command returns to IDLE state. As for bank-miss (at the same row or not) situations, memory interface scheduler collects the access commands for the corresponding bank controllers and then sends to arbiter at the suitable time. Besides the access FSM, each bank controller needs a row address register to record the activated row. By comparing incoming commands with row address registers for each bank controller, the bank-miss with row-hit or bank-miss with row-miss status can be detected.

¾ Memory Interface Access Scheduler

The memory interface access scheduler allocates and overlaps successive commands according to access status produced by status detection as Figure 4.17 shows. The memory interface access scheduler can perform scheduling with READ and WRITE operation. In brief, double access FSMs for individual bank controller can handle access conflict at the same bank, while master bank controller is responsible for access overlapping between different banks. After scheduling SDRAM access commands, the bus utilization can be raised efficiently; meanwhile the throughput of the entire video decoder can be improved. The

arbiter allocates write / read data and command flow to / from external SDRAM memories according to the access operation. Due to multiple reference picture supported, field list index controller is required to address frame start point.

4.3 Simulation Results

Considering system level analysis on decoder, memory controller, and external memory depicted in Figure 4.19, because decoder and memory controller are both in operation and data transmission only during the period of reading reference data, we only have to analyze the data transfer in this period.

Figure 4.19 System level analysis relation

Before going into detail of the following analysis, we define the following equations to measure the performance of data transfer on the bus.

# of bus cycles required by Memory interface # of 4x4 sub block # of frame

4x4 sub block frame sec

Bus Utilization

# of bus cycles available # of 4x4 sub block # of frame

4x4 sub block frame sec

× ×

(4.1)

# of data required by decoder # of 4x4 sub block # of frame

4x4 sub block frame sec

Data Usage

# of data available from Memory interface # of 4x4 sub block # of frame

4x4 sub block frame sec

Based on the assumption of the data bus is only provided for unique frame and mv memory, higher bus utilization induces better throughput for our video decoder. The data usage is correlated to the burst length and required window size between decoder and memory controller. Hence, data usage can be treated as the proportion of required data for decoder over the available data from SDRAM controller. In other words, the data usage is related to burst length in memory setup. To explain data usage clearly, considering 9 x 9 interpolation window of a 4 x 4 block in H.264 fractional motion compensation, an example of the fetching window for four different burst lengths is illustrated as Figure 4.20. Fetching window is the total pixels that are required to be read from SDRAM controller. Since the data bus width is limited as 4-pixel (32 bits), the height of fetching window must be 12-pixel that is a multiple of 4-pixel when burst length is 4. Similarly, the width of other fetching window must be the multiple number of the burst length. Accordingly, among these burst length modes, the data usage is the poorest when the selected burst length is 4. From equation (4.3), data utilization is the multiplication of bus utilization and data usage. Therefore, the data utilization can be considered as the required data proportion in decoder over the allowable data transmission of the external bus. Higher data utilization means that we can get better throughput and less latency for the entire video decoder performance.

12

Figure 4.20 Fetching windows of 4x4 block between different burst length

Figure 4.21 Unscheduled Bus utilization, Data usage and Data utilization for different burst length in memory

Figure 4.22 Scheduled Bus utilization, Data usage and Data utilization for different burst length in memory.

Figure 4.23 The data utilization between un-scheduling and scheduling

Figure 4.24 Average access cycles per MB between different burst length for access under BUS.

Figure 4.21 and 4.22 shows the unscheduled and scheduled system level analysis of the criteria (4.1) ~ (4.3). Obviously, the longer burst length provides higher bus utilization instinctively because the short access cycles are required for the more amount of fetching data.

After scheduling, since longer read burst cycles can provides long overlapping period for the successive access commands, for instance, burst length = 4 has the highest bus utilization.

Although burst length = 4 reflects the highest bus utilization, the lowest data usage leads that the data utilization become the lowest among these burst modes. The data usage is influenced extremely due to different amount of fetching windows among different burst length modes.

Considering better data utilization for decoder, Burst length = 1 mode is the better choices on the high-throughput video decoding system. The Figure 4.23 shows the data utilization between un-scheduling and scheduling. Obviously, the bus utilization can be improved about 90% using memory scheduling. Therefore, the data utilization can be improved efficiently.

For H.264/AVC HDTV decoder, the average execution cycles per P_MB and B_MB

within 1080HD sequence at bit rate is 614Kbit/s environment for comparing different data-reuse approach is depicted in Figure 4.24. After inducing data reuse technique, E2CMA method, mentioned in Chapter 3, the execution cycles can reaches 100 ~ 150 cycles approximately. After memory scheduling, the execution cycles with E2CMA approach can be reduced about 150 ~ 200 cycles again. Comparing ref [9], the execution cycles per P_MB and B_MB can tremendously reduce up to 55 %. Based on our decoding system, the raise of bus utilization and reduction of access latency reduce the required execution cycles per P_MB and B_MB. Accordingly, it can improve throughput of the entire video decoder because the computation time of motion compensation dominates the video decoder especially in H.264/AVC decoder. The bandwidth of memory access among different bit rate is depicted as Figure 4.25. The size of test sequence is 1080HD format, and the burst length within SDRAM is defined as 4. The bandwidth which is proposed by our proposed data reuse approach is better than other approach, especially at high bit-rate. Furthermore, E2CMA with memory scheduling technique is applied so that bandwidth can be further improved. Therefore, the bandwidth of memory access can be efficiently improved by out proposed data reuse approach. Besides, the throughput of entire video decoder working at 100MHz is shown as Figure 4.26. For supporting high resolution such as 1080HD, the system specification with level 4.0 has to be supported by video decoder. The throughput of decoder which applies E2CMA and memory scheduling is double than the one apply previous data-reuse approach.

The decoder which applies Column major or Ref. [9] may be not arrive specification at level 4.0 in H.264/AVC standard, especially in the high bit-rate environment. That is, sequences with 1080HD format can be not decoded in a real-time system. For supporting higher resolution sequence such as 1080HD, the E2CMA and memory scheduling technique is suitable for HDTV decoder in real-time system.

8.5 9.324 11.07 12.981 15.373 18.927 22.929

Bit Rate (Mbits/s)

Bandwidth (MBytes/s)

Column Major Ref. [9]

E2CMA E2CMA + scheduling

Figure 4.25 The bandwidth of memory access under external BUS among different bit-rate

Figure 4.26 The throughput of motion compensation for different data-reuse approach when operating frequency is 100Mhz.

4.4 Summary

In the applications requiring high performance SDRAM subsystem, any bandwidth loss may result in a system failure. For H.264/AVC decoder with main profile, the effect is the critical issue. Hence, the memory controller must be carefully designed to prevent any possible bandwidth loss. For above reason, we proposed a bandwidth-efficient memory controller that build-in device on a video decoder, and can be supported in different modules of H.264/AVC decoder. The proposed memory controller can deals with dual data type:

motion vector and pixels. Allowing users to configure access mode for each SDRAM bank also gives more flexibility. We not only use the memory interface scheduler to do scheduling but also adopt the efficiently data arrangement to reduce the miss rate, and to increase utilization of memory space. From a system level analysis, we can observe that the bus utilization and access latency can be improved to 90%. The bandwidth of memory access between decoder and external memory can be improved as 50% approximately. The throughput of decoder can conform to system specification at level 4.0, especially working at high bit-rate.

Chapter 5 Chip Implementation

5.1 Chip Specification

Table 5.1 H.264/AVC main profile decoder specification for motion compensation

Table 5.1 lists the specification of our bandwidth-efficient motion compensation architecture for H.264 HDTV decoder. After synthesis on Cadence RTL complier using UMC 0.13 um COMS technology, total gate count is 557730 (including embedded SRAM) and the gate count of each component is listed for video decoder in table 5.2. The Die size of H.264

decoder is 3100 mm x 3100mm. Table 5.3 lists on/off chip memory used on each module in our design. The chip photo of H.264 decoder is illustrated as Figure 5.1. The average power consumption of system is 50mW approximately. Furthermore, about synthesis results of our proposed motion compensation and memory controller, the power consumption of motion compensation is 9.53mW and the power consumption of memory controller is 3.9mW at 100MHz, the gate count is 83515 and 8584 for motion compensation and memory controller respectively.

Table 5.2 Synthesis results of H.264/AVC’s main profile decoder including SRAM

Table 5.3 On/Off-Chip memory size for different module in H.264 main profile decoder

CAVLC

Figure 5.1 CHIP photo for H.264/AVC main profile decoder

Chapter 6 Conclusion

In this thesis, we present a bandwidth-efficient motion compensation memory controller organization for H.264 HDTV decoder and support 1080HD 30fps@L4 high-quality format.

The proposed motion compensation engine realizes all advanced features including MV generators with direct modes, combined luma/chroma interpolator, and weighted prediction of H.264/AVC main profile. Concerning the design of interpolator, 4-parallel separate 1-D architecture gives the most space on high throughput video decoder compared with other architectures proposed. An Extend-2D column major approach is presented, and the proposed data reuse technique for fraction motion compensation introduces content buffer, content-swap operation and register-file shifting attached on our interpolator design. This design improves 50%-60% bandwidth with B-slices under external data BUS. Additionally, a combined luma/chroma interpolator is proposed in order to save area, which achieves approximately 44% of cost reduction. Altogether, memory usage and bandwidth are optimized by our proposed design.

Besides, the decoder system bottleneck resulted from the performance limitation of the off-chip SDRAM subsystem leads system designers to put more efforts on SDRAM efficiency.

In conventional SDRAM controller designs, though different requirements for SDRAM service of the heterogeneous system components are often considered, high bandwidth utilization can be achieved for special applications such as high definition TV. For this reason, the proposed memory controller can reduce bandwidth over external BUS using memory

scheduling and improve data access hit rate using data arrangement. For reducing bus utilization, the memory controller architecture is proposed and related approaches are employed as well. This design target of interpolator and frame memory access controller is to reduce external memory access and improve throughput of the entire video decoder. The SDRAM memory access controller appended to video decoder is presented to overcome the tremendous transfer of pixel data to/from external frame memories. To achieve efficient memory access scheduling, we discuss not only memory scheduling but also data arrangement within SDRAM. The proposed data arrangement in our scheduling scheme can minimize the miss ratio (at the same bank) that contributes the maximum latency among all scheduling cases. We create system level hardware-like C++ model and use data utilization to analyze the system performance. Compared to unscheduled situation, the experimental result shows that the access latency can be reduced by 50 % ~ 90 % and bandwidth utilization can be improved up to 90%. In the meanwhile, the throughput of the overall video decoder improves about 50 % ~ 60 % after combining extended RSO method and memory scheduling.

Besides, the gate count of motion compensation and memory controller is 83515 and 8584 respectively in synthesis results. The average power consumption of motion compensation and memory controller is 9.45mW and 3.9mW approximately at 100MHz.

Bibliography

[1] “Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification,” Joint Video Team (JVT), Int. Telecommun. Union-Telecommun. (ITU-T) and Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC), ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, May 2003.

[2] “Information technology-generic coding of moving pictures and associated audio information: Video,” ITU-T H.262, ISO/IEC 13818-2, 1994.

[3] Joint Video Term H.264/AVC Reference Software, Version JM 9.2.

http://iphome.hhi.de/suehring/tml/download/ .

[4] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, “Overview of the H.264/AVC

在文檔中適用於H.264 HDTV 解碼器之有效率動作補償記憶體架構 (頁 91-0)