CHAPTER 4 MEMORY BANDWIDTH REDUCTION
4.1 R EDUCTION STRATEGIES OF MEMORY BANDWIDTH
Memory bandwidth always dominates the performance of entire video decoder. Several methods have been proposed to reduce the required memory bandwidth and they can be mainly classified to two directions, first one is frame recompression and another one is redundancy reduction of pixels transmission. With regard to the frame recompression, Fig 4.2 illustrates the concept. Frame data will be compressed before writing to frame memory, and reference frame data will be decompressed before reading into video decoder. However, frame recompression method must consider many issues which like necessary random access capability demanded from motion compensation, low complexity property due to area cost and power saving, and minimize required additional execution cycles to compress/decompress data such that meet the real time throughput requirement of video decoder. Here we do not go into detail because our system have two dedicated modules, embedded compressor, between motion compensation and frame memory and embedded decompressor between frame memory and de-blocking module respectively.
Video Decoder
Frame Memory
recompress
decompress
Global bus
Fig 4.2 Embedded compress/decompress method
As for second solution, transmission reduction of redundant pixels, which can be classified into two solutions that first one is data fetch time reducing and the other one is data (pixel) reusing. The following subsection will discuss the detail of reduction strategies of memory bandwidth. Subsection 4.2.1 illustrates first strategy of data fetch times reducing.
Subsection 4.2.2 gives second strategy of data fetch times reducing. Subsection 4.2.3 illustrates first strategy of data reusing. Finally, subsection 4.2.4 presents second strategy of data reusing.
4.1.1 Exact Fetch Necessary Pixels
a c
Fig 4.3 Fractional sample positions for quarter sample luma interpolation
Fig 4.3 illustrates the luma samples „a‟ to „s‟ at fractional sample positions. In traditional method, when interpolate fractional pixel, it always fetch 9x9 interpolation windows.
However, there are not all pixels required in all fractional sample position. For example, the sample at half sample position labeled b is derived by the nearest integer position samples in the horizontal direction. Similarly, the sample at half sample position labeled h is derived by the nearest integer position samples in the vertical direction. Fig 4.4 illustrates interpolation of the samples at a, b, and c positions only need 9 x 4 interpolation windows. Fig 4.5 illustrates
interpolation of the samples at d, h, and n positions only need 4 x 9 interpolation windows.
We can depend on motion vector value to exact fetch necessary pixels instead of fetch 9 x 9 interpolation window. Similar to luma interpolation, chroma interpolation can determine motion vector to decide interpolation window as well. Table 4.1 shows the summary of luma interpolation windows. Table 4.2 shows the summary of chroma interpolation windows. The strategy is also used in other design [14], [10], and [11]. As for bandwidth reduction result, we will show it later.
4x4 output pixels
9x4 reference pixels
interpolation
Fig 4.4 Fractional sample only need horizontal samples .
interpolation
4x9 reference pixels 4x4 output pixels
Fig 4.5 Fractional sample only need vertical samples
Table 4.1 Summary of luma interpolation windows Pixel position Interpolation Window Size G (Integer) 4x4
a, b, c (Horizontal) 4x9 d, h, n (Vertical) 9x4 e, g, p, r 9x4+4x5
others 9x9
Table 4.2 Summary of chroma interpolation windows Pixel position Interpolation Window Size
Integer 2x2
Horizontal 3x2
Vertical 2x3
Others 3x3
4.1.2 Pre-fetch Mechanism
The second strategy of reduced fetching times is Pre-fetch Mechanism. Frame memories are such the largest memory storage over the entire video decoder that it are located on off-chip. Because bus interface has fixed width, every fetching may fetch unneeded pixels when fetch interpolation windows. If we save these unneeded pixels, it may be used in the future. Hence, we can further reduce fetching times. Fig 4.6 illustrates the interpolation window mismatch with bus interface and pre-fetch mechanism. The strategy is also used in other design [11]
Bus interface is 32bit=4 pixels
9x9 pixels windows size
Pre-fetch pixels
...
Memory boundary
Fig 4.6 Pre-fetch mechanism
4.1.3 Intra MB Pixel Reusing
4
4
5B
A
B
A
4
4
5 interpolation
interpolation
interpolation
interpolation
Fig 4.7 4x4 block window and the corresponding 9x9 interpolation window and overlapped region for neighboring interpolation window
Similar to reduced fetching times, pixel reusing can separate into intra MB overlap pixels reusing and inter MB overlap pixels reusing. The concept of overlap pixels reusing is if two motion vectors of horizontal neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be reused. Similarly, if two motion vectors of vertical neighboring 4 x 4 blocks are the same, 9 x 5 overlap region between two interpolation windows can be reused. Fig 4.7 illustrates four motion vectors of neighboring 4 x 4 blocks are the same and the corresponding 9 x 9 interpolation windows. We can see there are two vertical 5 x 9 overlap region indicated by “A” and two horizontal 9 x 5 overlap region indicated by “B” can be reused.
The first strategy of overlap pixels reusing is Intra MB Overlap Pixels reusing. Fig 4.8
illustrates the Intra MB overlap pixels reusing. There are some methods have been proposed in [14-16]. In Tsai‟s [14],Tsai proposed VIDZ to achieve horizontal and vertical data reusing.
Besides, based on the VIDZ flow, all vertically overlapped interpolation windows can be reused without additional storages. However, the violation of the inherent double-z-scan order, VIDZ cannot fit into a 4 x 4-block level pipeline. Moreover, in system view, VIDZ induces extra synchronization buffers because of different scanning order with other modules (for example, residual decoder) which must follow scanning order in standard [1].
Intra MB interpolation window overlap region
0 1 4 5
2 3 6 7
8 9 12 13
10 11 14 15
Fig 4.8 Intra MB overlap pixels reusing
4.1.4 Inter MB Pixel Reusing
The second strategy of overlap pixel reusing is Inter MB Overlap Pixels reusing. Up to now, literatures of neighboring-based pixels reusing almost focus on reusing pixels which inside the same MB. However, there are overlap region between interpolated windows which located on neighboring MB can be reused. Fig 4.9 illustrates overlapped region for
neighboring interpolation windows on horizontal neighboring MB. Only stores horizontal MB overlap regions is our selection. This is because if we want to reuse vertical MB overlap regions, there are MB regions of entire frame width needed to be store and only provide limited space of improve efficiency. Subsection 4.3 will show the analysis.
0 1 4 5
Fig 4.9 Inter MB overlap pixels reusing
The required content buffers are 5 x 9 pixels and 9 x 5 pixels for horizontal and vertical overlapped region for neighboring interpolation windows respectively. In order to minimize the content buffer size, the lifetime analysis of reference data shows that only three horizontal and three vertical blocks is required to be saved in the worst case.Table 4.3 shows the lifetime analysis. Horizontal axis shows 4 x 4 partition ordering, vertical axis shows the used storages, and filed is which partition horizontal or vertical overlap region of partition is stored. For example, in partition 1, horizontal overlap region of partition 1 will be stored in H0 and vertical overlap region of partition 1 will be stored in V0. Content buffers can be implemented in local registers or SRAM. However, SRAM needs several cycles to finish content-swap operation, so we use local registers in order to minimize latency on carrying out content-swap.
Table 4.3 Storage requirement and lifetime analysis
4 x 4 Storage
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 …
H0 1 1 3 3 5 15 …
H1 7 1 1 3 …
H2 9 9 11 11 13 …
V0 0 1 2 7 12 13 0 1 2 …
V1 3 8 9 3 …
V2 4 5 6 11 …
Time