CHAPTER 3 MOTION COMPENSATION DESIGN FOR MPEG-2/H.264 VIDEO
3.2 C OMBINED M OTION C OMPENSATION E NGINE FOR MPEG-2/H.264 D UAL V IDEO
3.2.2 Cost Analysis
Table 3.4 # of adders for each filter design
Module Function # of adders
Horizontal/vertical FIR 1/2 luma for H.264 6
Bilinear (average)
1/4 luma for H.264 1/2 luma and chroma for
MPEG-2
1
1/8 horizontal/vertical FIR Chroma for H.264 5 Combined horizontal/vertical filter 1/2 luma for H.264
Chroma for H.264 6
Adder occupies the main area cost in the filter design. Firstly, Table 3.4 lists the number of adders used in each kind of filter design described in previous subsections. The horizontal/vertical FIR design presented in Chen’s [9] and bilinear design are illustrated in Fig 3.25. Chroma 1/8 horizontal/vertical filter, which modifies the multiplier-based design depicted in Fig. 3.19 (b) to adder-based design painted in Fig. 3.26 (b), requires 5 adders.
Table 3.5 lists the comparisons between our reconfigurable interpolator design and traditional design. It reveals that the amount of adder and register efficiently reduced although it requires paying some control circuits to support multi-mode operations. After synthesizing based on technology of UMC 0.18 um, the total area gate count can be reduced about 20 %.
Table 3.5 Comparison of requisite modules based on 4-parallel separate 1-D architecture Traditional design: separated H.264 luma, H.264 chroma
and MPEG-2 bilinear interpolator
Module Traditional design Our Reconfigurable design
Horizontal FIR 9 7
Vertical FIR 4 2
Bilinear (average) 8 4
1/8 horizontal bilinear 2 0
1/8 vertical bilinear 2 0
Combined horizontal filter 0 2
Combined vertical filter 0 2
Content buffer 54 x 8 bits 54 x 8 bits
Shift register array (54 + 18 + 4) x 8 bits 54 x 8 bits
Adder 106 Adder 82
Total Register
(# of bits) 1040 Register
(# of bits) 864
Gate count 16376 13013
3.3 Summary
In this chapter, a motion compensation engine for MPEG-2/H.264 dual-video decoder is presented. To overcome the tremendous data access to frame memories, especially in the high motion precision for the advanced video standard, H.264/AVC, the proposed data reuse technique for fractional motion compensation can efficiently reduce the requisite reference data. As for sharing design issue for multi-standard, our reconfigurable interpolator saves 20
% gate count compared with traditional design and it fully supports standard-compatible fractional interpolation for MPEG-2 /H.264 video decoder. Besides, the 4-parallel separate 1-D architecture is also suitable for high throughput SDTV/HDTV video decoder.
Chapter 4
Frame Memory Organization
To deal with tremendous data transfer and storage in multimedia system, software or hardware technologies must provide high data bandwidth and efficient real-time memory scheduling. As for video decoding system, irregular data access property and large storage of multidimensional organization always dominate the system performance including throughput and power consumption [12]. To flexibly support from mobile device up to high-definition TV, frame memories, which are the largest memory storage over the entire video decoder, are located on off-chip. Nevertheless, the data transfer to off-chip memory is always bound to the limited bandwidth. To improve memory bandwidth, new modern DRAM families such as synchronous DRAM (SDRAM), reduced latency DRAM (RLRAM) and double-data-rate SDRAM (DDR SDRAM) are now widely applied in video system [13]. In this chapter, we choose SDRAM as external frame memory.
Many SDRAM controllers have been proposed to improve memory bandwidth utilization and achieve efficient memory access. According to the environment, they can be categorized into two classes: single channel and multiple channel environments. For single channel environment, Rixner’s memory access scheduler [14] reorders the access addresses from each bank controller and sends command to DRAM through address arbiter. However, because the output command may be out-of-order, many command FIFOs and extra circuits are required to reorder commands and addresses. Miura‘s dynamic-SDRAM-mode-control scheme [15] eliminate the above disadvantage and it can both reduce operating current and
sequence. For multi-channel environment, Lee’s quality-aware memory controller [15]
supports different scheduling policies according to the current channel situation. These memory controllers mainly focus on general-purposed environment. On the other way, concerning particular-purpose orientation especially in video codec application, several papers have been proposed on improvement of power consumption or memory bandwidth utilization. Kim’s memory–interface architecture [17] reorganizes data arrangement in SDRAM to reduce energy consumption and increase memory bandwidth. Park’s history-based memory mode control [18] reduces page miss to achieve 23.3 % reduced energy consumption and 18.8 % reduced memory latency. Zhu’s SDRAM controller in H.264 HDTV Decoder [20]
focuses on memory mapping and data arrangement in SDRAM to reduce page active cycles;
meanwhile, it also improves throughput and provides less power consumption. However, it does not provide memory scheduling and the adoption of auto precharge rather than manual precharge also leads to some loss of bus bandwidth. We will show the advantage of manual precharge in subsection 4.1.2. The above memory control techniques individually concentrate on memory scheduler or data arrangement in SDRAM. Both issues should be taken into account carefully in memory controller, especially for multi-dimensional oriented system, such as video codec and graphic processor unit (GPU). To achieve all-round integration, we consider both memory access scheduling and data arrangement to design our SDRAM controller. In addition, not only communication between SDRAM controller and data bus has to be analyzed, but also interface between motion compensation and SDRAM controller has to be taken into account carefully. The above discussion of related works is summarized in Table 4.1, and the application of our dual-channel SDRAM controller focuses on build-in video decoder. Section 4.1 will give detailed design for our dual-channel SDRAM controller.
Table 4.1 Related works of SDRAM memory controller
Related work Application Improvement Techniques
Rixner’s [14] General-purpose Single-channel
Bandwidth,
latency Memory scheduling
Miura’s [15]
32-bit RISC CPU Single-channel
STB
Latency,
Power Memory mode control
Lee’s [15]
Latency Memory scheduling
Kim’s [17]
Bandwidth Data arrangement
Park’s [18]
Zhu’s [20] H.264 HDTV decoder Multi-channel
Bandwidth utilization,
Decoding throughput Data arrangement
Our work
Frame memories always dominate the storage size on the video decoder. Generally speaking, at least two frame memories, which are used to store current and reference frames, are required for H.264@Baseline video decoder. Several methods have been proposed to reduce the required memory and they can be mainly classified to two solutions that one is frame recompression and another is frame memory reorganization. Concerning the first solution, which concept is depicted in Fig. 4.1, is recompressing video frame data before storing to frame memory, and equivalently decompression is required when reading reference frame data from frame memory. This recompression method must support random access capability demanded for motion compensation and low complexity property due to limited memory bandwidth. In this respect, many algorithms, such as Tajime’s [22] 2-D adaptive DPCM in pixel domain, and Lee’s [23] modified Hadamard transform and Golomb-Rice (GR) coding., etc have been proposed. However, frame recompression method leads to extra area cost and even requires additional execution cycles to compress data such that the throughput of video decoder degrades. As for second solution, frame memory reorganization, this idea, which combines the current frame and reference frame, can be initially found in De Greef’s [24]. Besides, Interuniversity MicroElectronics Center (IMEC) widely exploited this idea to H.264 video decoder system [25], MPEG-4 motion estimation [26] and video encoder [27].
Particularly in Brockmeyer’s [26] and Denolf’s [27], the concept of memory hierarchy [28]
combined with merging structured frame memory can achieve data reuse and reduce the redundancy of data access. However, they only focus on C level simulation and target on DSP or FPGA platform. If we want to implement on ASIC design, many issues still have to be overcame. For example, the data copy and update between background memory and intermediate memory are required being considered in ASIC design. Concerning another issue, we also need extra hardware cost to record the update status in in-place FIFO [25], the intermediate region between the new frame and old frame. For advanced development,
and reduce up to 83 % average latency and 39 % average power consumption. We will discuss the methodology proposed by Chang’s [29] and exploit it on H.264/AVC video decoder.
Video Decoder
Frame Memory
recompress
decompress
Fig. 4.1 Frame recompression method
The reset of this chapter is organized as follows. Firstly, SDRAM characteristic is described in section 4.1. Then, section 4.2 discusses our dual-channel frame memory access controller design. In addition, merging structured frame memory organization, a novel memory structure that can reduce required frame memory size, is presented in section 4.3.
Finally, summary is given in section 4.4.
4.1 SDRAM characteristic
Fig. 4.2 Simplified architecture of a 4-bank SDRAM
IDLE ACTIVE
precharge
row active
column access
Fig 4.3 Simplified bank state diagram
A simplified architecture of a 4-bank SDRAM is shown in Fig. 4.2. Four banks share the address bus and command bus, while each bank has individual row decoder, sense amplifier, and column decoder. The mode register stores several SDRAM operation modes, including burst length (BL), column address strobe (CAS) latency (abbreviated as CL), or burst type (sequential / interleave). The content of mode register updates according to command issued from address bus. SDRAM can be treated as 3-D structure with the dimensions of bank, row,
Fig 4.3, contains three steps including row activation, column access, and precharge. Firstly, a row activation command with bank address is sent to open (or active) one row in a particular bank and the designated row address is issued from address bus. The operation of this command is copying the row data into the row buffer of the selected bank and row activation needs a active latency called tRCD (ACTIVE to READ or WRITE delay) to accomplish this operation. Then, column access command is used to sequential access data or single data according to the burst length and burst type in the mode register. The read/write data are access/send thorough the same data bus. For read operation, the valid data-out element from the starting column address will be available following the CAS latency after the READ command, as shown in Fig. 4.4. For write operation, the first valid data-in element is coincident with the WRITE command, as shown in Fig. 4.5. Finally, a precharge command must be issued before opening a different row in the same bank, whereas a precharge and active command need not to be issued if the following access still in the same row and bank.
After precharge command is issued, the selected bank cannot be accessed during the precharge latency named tRP (PRECHARGE command period.)
address
NOP NOP NOP NOP NOP NOP
0
clock
1 2 3 4 5 6 7 8 9
Fig. 4.4 Burst read operation with CL=3 and BL=4.
address bank address
command ACT
bank0 row0
CL=3, BL=4
NOP
DQ
WRITE
D_000 D_001 col0
bank0
NOP
tRCD
D_002 D_003
NOP NOP NOP
0
clock
1 2 3 4 5 6
Fig. 4.5 Burst write operation with CL=3 and BL=4.
Table 4.2 CAS latency
CL 1 2 3
Allowable operating
frequency (MHZ) ≦50 ≦100 ≦166
4.1.2 Access Latency
Lee discussed different access latencies of different access statuses in [15]; however, detailed classification is required for exquisite access command scheduling. The memory behavior model used in our design is Micron’s MT48LC2M32B2P-5 64Mb SDRAM [21].
Table 4.1 lists three different allowable maximum operation frequencies provided in this SDRAM according to the CAS latency stored in mode register. Obviously, when setting CAS latency to 3, the SDRAM can provide higher operating frequency. However, higher operating frequency induces more stall cycles is demanded for each read column access. Therefore, the CAS latency should be set carefully for different applications. For example, 50 MHZ with
large frame size format such as SDTV or HDTV format.
tRP tRCD Cas Latency
DATA
Fig. 4.6 Access latency for CL=2 (a) read access latency, (b) write access latency
Fig. 4.6 illustrates read/write access latency under different statuses when CL =2. Row miss status means that the activated row in selected bank is not identical to the incoming access command and it induces (PRECHARGE + ACTIVE + CAS) latency for read access and (PRECHARGE + ACTIVE) latency for write access. Bank-miss with row-miss status means that incoming bank address is different from previous command and the selected row for the incoming bank address is not activated. For this status, required latency is the same as that of row-miss status. Bank-miss with row-hit status indicates that the incoming row has been activated in the previous command although the incoming bank is not equal to the previous one. For this status and row-hit status, the column access can be directly issued and
only read access leads to CAS latency. Based on the above discussion, memory scheduling can overlap the sequential access commands and hide full or partial latencies.
address
NOP NOP NOP NOP NOP NOP
0 Cannot issue another command to the same bank (bank0)
Fig. 4.7 READ command with auto precharge In the precharge period (tRP), SDRAM cannot issue
another command in the same bank (bank 0).
SDRAM also supports another precharge method called auto precharge without requiring an explicit precharge command. A PRECHARGE of bank/row together with READ/WRITE command is automatically performed upon completion of READ/WRITE burst, except in the full-page burst mode, where auto precharge does not apply. Auto precharge ensures that the precharge is initiated at the earliest valid stage within a burst. As shown in Fig. 4.7, in the precharge period, it cannot issue another command to the same bank until the precharge time (tRP) is completed. If the following command must active to the same bank, the overlap scheduling is limited to this situation such that the following command can be issued only until the completion of tRP period or reorder with the other command. For another disadvantage induced by auto precharge, READ/WRITE with auto precharge means that
SDRAM always de-active the selected bank at the end of a burst command. If the following command still issues the same bank, it must waste time to re-active the same bank and lead to longer latency at the same time. Therefore, we select manual precharge rather than auto precharge in our memory access controller design.
4.2 Dual Channel Frame Memory Access Controller
Frame 0 Frame 1
memory controller
WRITE
Frame 0 Frame 1
memory controller
READ WRITE
current reference current
Frame 0 Frame 1
memory controller
READ READ
reference reference
(a) (b) (c)
Fig. 4.8 READ/WRITE operation in (a) I frame, (b) P frame, (c) B frame.
For frame memory access in video decoding system, we only need to concentrate on consecutive read or write access instead of read-to-write or write-to-read access because read/write operation changes at frame level. Based on conventional ping-pong structured frame memories [28], which one stores reference frame and another stores current frame, Fig.
4.8 shows read/write operation in three different frame types. For I frame, memory access controller write reconstructed data to current frame. For P frame, memory access controller reads referenced data while writing reconstructed data to current frame. For B frame in MPEG-2, memory access controller reads data from previous and following reference frame because B frame is never referenced. Nevertheless, B frame has a chance that referenced by other P/B frame for H.264/AVC video decoder, so the frame memories issue becomes more complicated. We can set nal_ref_idc flag in H.264/AVC such that B frame is never used to be
reference frame. In this section, we only focus on I/P frame for H.264 and I/P frame for MPEG-2 video decoder.
Table 4.3 Characteristics of read/write-access Access Required density Influence factors
Read High Bitstream, memory scheduling,
data arrangement in memory Write Low or medium Bitstream, capability of residual decoder,
(de-blocking filter only for H.264)
4.2.1 Memory Access Scheduling
The target of memory access scheduling is overlapping or reordering consecutive DRAM commands (PRECHARGE, ACTIVE, CAS) to improve bandwidth utilization and reduce access latency. Because the external access of video decoder is bandwidth-sensitive channel [15], memory access scheduler must compress and even reorder DRAM commands to achieve high bandwidth utilization. Furthermore, considering read-access and write-access respectively, the required density of write-access has high correlated with the ability of residual decoder and the property of decoding bitstream, while the required density of read density is as tight as possible. For high bit-rate video sequence, the decoded bitstream contains more coefficients and higher precision of decoding token that may induce more requisite decoding cycles. In this situation, the write-access becomes less bandwidth-sensitive and the density of write access is not necessarily very tight. The poor design of residual decoder, de-blocking filter also affects the bandwidth utilization of write access. Unlike the limitation of write access described above, read access needs high density of access scheduling because of its high bandwidth-sensitive channel. Read requests are only sent by
memory scheduler design, data arrangement in SDRAM and the handshake command scheme of motion compensation. The characteristics of write/read-access discussed above are summarized in Table 4.3.
PRE ACT READ DATA
PRE ACT READ DATA
PRE ACT READ DATA
PRE ACT READ DATA
clock
0 1 2 3 4 5 6 7 8 9 10 11 12
unscheduled
scheduled
13 14 15 16 17 18 19 20
Fig. 4.9 Two row-miss unscheduled and scheduled read memory accesses (CL=2, BL=4)
Considering read/write access from/to frame memories, the requirement of write-access is low or mediate density depend on the capability of residual decoder, whereas motion compensation requires high density of read-access. Therefore, we only concentrate on read access and design a high-density scheduler for read-access and it must be also suitable for write-access. Fig. 4.9 shows an example of two unscheduled and scheduled read memory accesses when occurring row miss at different bank. For unscheduled read, We choose (CL=2, BL=4) as an example, and then the unscheduled accesses takes 20 cycles to read eight burst data, whereas the scheduled accesses only requires 14 cycles and eight burst data can be sequential read. From the access latency discussion in section 4.1.2, the access command without auto precharge can be classified to two types, one is long command (PRE + ACT + CAS) and another is short command (CAS), painted in Fig. 4.5. Moreover, we consider the latency after access scheduling under BL=1, 2, 4 situations illustrated in Fig. 4.10-12 and summaries the induced latency under each situation in Table 4.4. Obviously, we can find that the worst latency is always located in row-miss situation. To reduce the access latency, the command request ordering and data arrangement should follow the orientation of minimizing
the row-miss occurrence.
READ: BL=1, CL=2
PRE ACT READ DATA
PRE ACT READ DATA
Row-miss
tRAS
Bank-miss with row-miss
PRE ACT READ DATA
PRE ACT READ DATA
READ DATA
READ DATA
PRE ACT READ DATA
READ DATA
PRE ACT READ DATA
READ DATA
READ DATA
PRE ACT READ DATA
Row-hit
Fig. 4.10 Scheduled consecutive read access for (BL=1, CL=2) when previous command is (a) long command (PRE+ACT+CAS), (b) short command (CAS)
PRE ACT READ DATA
PRE ACT READ DATA
tRAS
PRE ACT READ DATA
PRE ACT READ DATA
READ DATA
READ DATA
PRE ACT READ DATA
READ DATA
PRE ACT READ DATA
READ DATA
READ DATA
PRE ACT READ DATA
0 1 2 3 4 5 6 7
Fig. 4.11 Scheduled consecutive read access for (BL=2, CL=2) when previous command is (a) long command (PRE+ACT+CAS), (b) short command (CAS)
PRE ACT READ DATA
PRE ACT READ
tRAS
PRE ACT READ DATA
PRE ACT READ DATA
READ DATA
PRE ACT READ DATA
0 1 2 3 4 5 6 7
Fig. 4.12 Scheduled consecutive read access for (BL=4, CL=2) when previous command is (a) long command (PRE+ACT+CAS), (b) short command (CAS)
Table 4.4 Latency for scheduled consecutive read access when CL=2
BL Previous command Incoming command Latency
Row-miss 4 Bank-miss with row-miss 2
(PRE + ACT +CAS)
Row-hit or
Bank-miss with row-hit 0
Row-miss 4 Bank-miss with row-miss 4
1
CAS
Row-hit or
Bank-miss with row-hit 0
Row-miss 4 Bank-miss with row-miss 1
(PRE + ACT +CAS)
Row-hit or
Bank-miss with row-hit 0
Row-miss 4 Bank-miss with row-miss 3
2
CAS
Row-hit or
Bank-miss with row-hit 0
Row-miss 4 Bank-miss with row-miss 0
(PRE + ACT +CAS)
Row-hit or
Bank-miss with row-hit 0
Row-miss 4 Bank-miss with row-miss 1
4
CAS
Row-hit or
Row-hit or