CHAPTER 3 MOTION COMPENSATION DESIGN FOR H.264/AVC MAIN/HIGH
3.4 W EIGHTED P REDICTION
Weighted prediction is the final stage of motion compensation behind the interpolator.
Weighted prediction is a tool of scaling motion compensated samples to increase the video quality in H.264/AVC video decoding. In this subsection, weighted predictor architecture is proposed to collocate with interpolator and eliminate the latency overhead. Chen‟s [16]
proposed weighted prediction architecture has low complexity. However, it has long critical path and large memory requirement (1.5kb). The design of Azevedo‟s [12] weighted predictor is simply implemented by direct mapping design and require an embedded memory to store rounding coefficient. Compared with direct mapping design which equation is listed in Eq.
3.3, we can use the same predictor twice to generate predicted value, first is LIST_0 prediction and second is LIST_1 prediction as shown in Eq. 3.4. The component of rounding and offset can be advanced and combined in the same stage. Therefore, the predictor can be further modified to reduce the critical path as shown in Eq. 3.5. Moreover from Eq. 3.5, the W0 means weight factor and the value depend on weight flag from bit-stream. When weight flag is equal to 1, the value of weight factor shall be in the range of -128 to 127, inclusive.
When weight flag is equal to 0, weight factor shall be in the range of 20 to 27, inclusive [1].
From the above discussion, if we determine the highest weight factor two bit we can use an eight bits multiplier and shifter instead of a nine bits multiplier. The predictor is shown in Fig 3.17.
MUX +
offset
<<7
*
Weight factor[7:0]
predPart
LogWD
+
>>
Round Weight factor[8:7]
<<
Fig 3.17 Weighted predictor design
Moreover, when B slice is involved, we use hardware sharing to operate twice. In addition, a 4 x 4 storages array is required to store intermediate results. Fig 3.18 illustrates the complete weighted predictor design. The same as temporal direct mode in motion vector generator, weighted predictor has implicit mode which weighting factors are calculated based on the relative temporal positions of LIST_0 and LIST_1 reference picture. Weighting factor in the implicit mode is derived from temporal direct mode data-path in order to reduce hardware cost. Furthermore, divider occupies the main area cost and computation time in the temporal direct mode design. We can use loop-up table (LUT) to replace divider because the
dividend is a constant value. Table 3.2 lists the comparison for implementation results. For [12], it was not presented in comparison because lack of related detail information.
Predictor 1 Predictor
2
B-L1 M U X
P slice
C lip
Luma/Cr
4x4 Buffer Luma/Cb
A v er ag e
B-L0
Fig 3.18 Entire weight predictor architecture
ICASSP‟06[16] proposed
Multiplier (bits) 9 8
Technology .18um .90um
Gate count 12,960 6,412
Working frequency 87MHz 100MHz
Table 3.2 Comparison of execution cycles for different architectures
3.5 Summary
In this chapter, a motion compensation engine for H.264/AVC Main/High Profile decoder is presented. As for sharing design issue for multi-profile, our MVG use the same module and storages to deal with P slice and B slice which include MBAFF and non MBAFF.
Our restructured interpolator presents the area efficiently compared with traditional design and it is suitable for high throughput application such as coded in B slice video decoder.
Besides, the weighted predictor through hardware sharing with temporal direct mode and critical path shorten to achieve area efficiency. When weighted predictor collocates with interpolator, it only requires one cycle latency overhead.
Chapter 4
Memory Bandwidth Reduction
4x4 output pixels
9x9 reference pixels
interpolation
Fig 4.1 4 x 4 block window and the corresponding 9 x 9 interpolation window
Considering luma interpolation, the half position samples interpolated by applying 6-tap FIR filter and quarter position samples performed by applying using bilinear filter. It means interpolator needs six reference pixels to produce one interpolated pixel. Fig 4.1 shows to interpolate each fractional sample value for each 4 x 4 block size; it needs 9 x 9 interpolation window. Chroma interpolation, of which concept is similar to luma, interpolates each fractional sample value for each 2 x 2 block size, it needs 3 x 3 interpolation windows. When frame size is large and frame rate is high, interpolation causes heavy loading of memory bandwidth. Moreover, motion compensation involves Main/High Profile; it supports B slices in which reference frame from one direction increase to two directions. From the above discussion, Main/High Profile doubles the memory bandwidth requirement. In worst case,
interpolator needs memory bandwidth requirement, 398MB/s in P slices and 796MB/s in B slices, when support 1080 HD @ 30 fps. The heavy loading of memory bandwidth also means huge power consumption for bus activity and data operation.
The rest of this chapter is organized as follows. Firstly, section 4.1 discusses our reduction strategies of memory bandwidth. In addition, an analysis of bandwidth reduction limit is presented in section 4.2. Finally, summary is given in section 4.3.
4.1 Reduction strategies of memory bandwidth
Memory bandwidth always dominates the performance of entire video decoder. Several methods have been proposed to reduce the required memory bandwidth and they can be mainly classified to two directions, first one is frame recompression and another one is redundancy reduction of pixels transmission. With regard to the frame recompression, Fig 4.2 illustrates the concept. Frame data will be compressed before writing to frame memory, and reference frame data will be decompressed before reading into video decoder. However, frame recompression method must consider many issues which like necessary random access capability demanded from motion compensation, low complexity property due to area cost and power saving, and minimize required additional execution cycles to compress/decompress data such that meet the real time throughput requirement of video decoder. Here we do not go into detail because our system have two dedicated modules, embedded compressor, between motion compensation and frame memory and embedded decompressor between frame memory and de-blocking module respectively.
Video Decoder
Frame Memory
recompress
decompress
Global bus
Fig 4.2 Embedded compress/decompress method
As for second solution, transmission reduction of redundant pixels, which can be classified into two solutions that first one is data fetch time reducing and the other one is data (pixel) reusing. The following subsection will discuss the detail of reduction strategies of memory bandwidth. Subsection 4.2.1 illustrates first strategy of data fetch times reducing.
Subsection 4.2.2 gives second strategy of data fetch times reducing. Subsection 4.2.3 illustrates first strategy of data reusing. Finally, subsection 4.2.4 presents second strategy of data reusing.
4.1.1 Exact Fetch Necessary Pixels
a c
Fig 4.3 Fractional sample positions for quarter sample luma interpolation
Fig 4.3 illustrates the luma samples „a‟ to „s‟ at fractional sample positions. In traditional method, when interpolate fractional pixel, it always fetch 9x9 interpolation windows.
However, there are not all pixels required in all fractional sample position. For example, the sample at half sample position labeled b is derived by the nearest integer position samples in the horizontal direction. Similarly, the sample at half sample position labeled h is derived by the nearest integer position samples in the vertical direction. Fig 4.4 illustrates interpolation of the samples at a, b, and c positions only need 9 x 4 interpolation windows. Fig 4.5 illustrates
interpolation of the samples at d, h, and n positions only need 4 x 9 interpolation windows.
We can depend on motion vector value to exact fetch necessary pixels instead of fetch 9 x 9 interpolation window. Similar to luma interpolation, chroma interpolation can determine motion vector to decide interpolation window as well. Table 4.1 shows the summary of luma interpolation windows. Table 4.2 shows the summary of chroma interpolation windows. The strategy is also used in other design [14], [10], and [11]. As for bandwidth reduction result, we will show it later.
4x4 output pixels
9x4 reference pixels
interpolation
Fig 4.4 Fractional sample only need horizontal samples .
interpolation
4x9 reference pixels 4x4 output pixels
Fig 4.5 Fractional sample only need vertical samples
Table 4.1 Summary of luma interpolation windows Pixel position Interpolation Window Size G (Integer) 4x4
a, b, c (Horizontal) 4x9 d, h, n (Vertical) 9x4 e, g, p, r 9x4+4x5
others 9x9
Table 4.2 Summary of chroma interpolation windows Pixel position Interpolation Window Size
Integer 2x2
Horizontal 3x2
Vertical 2x3
Others 3x3
4.1.2 Pre-fetch Mechanism
The second strategy of reduced fetching times is Pre-fetch Mechanism. Frame memories are such the largest memory storage over the entire video decoder that it are located on off-chip. Because bus interface has fixed width, every fetching may fetch unneeded pixels when fetch interpolation windows. If we save these unneeded pixels, it may be used in the future. Hence, we can further reduce fetching times. Fig 4.6 illustrates the interpolation window mismatch with bus interface and pre-fetch mechanism. The strategy is also used in other design [11]
Bus interface is 32bit=4 pixels
9x9 pixels windows size
Pre-fetch pixels
...
Memory boundary
Fig 4.6 Pre-fetch mechanism
4.1.3 Intra MB Pixel Reusing
4
4
5B
A
B
A
4
4
5 interpolation
interpolation
interpolation
interpolation
Fig 4.7 4x4 block window and the corresponding 9x9 interpolation window and overlapped region for neighboring interpolation window
Similar to reduced fetching times, pixel reusing can separate into intra MB overlap pixels reusing and inter MB overlap pixels reusing. The concept of overlap pixels reusing is if two motion vectors of horizontal neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be reused. Similarly, if two motion vectors of vertical neighboring 4 x 4 blocks are the same, 9 x 5 overlap region between two interpolation windows can be reused. Fig 4.7 illustrates four motion vectors of neighboring 4 x 4 blocks are the same and the corresponding 9 x 9 interpolation windows. We can see there are two vertical 5 x 9 overlap region indicated by “A” and two horizontal 9 x 5 overlap region indicated by “B” can be reused.
The first strategy of overlap pixels reusing is Intra MB Overlap Pixels reusing. Fig 4.8
illustrates the Intra MB overlap pixels reusing. There are some methods have been proposed in [14-16]. In Tsai‟s [14],Tsai proposed VIDZ to achieve horizontal and vertical data reusing.
Besides, based on the VIDZ flow, all vertically overlapped interpolation windows can be reused without additional storages. However, the violation of the inherent double-z-scan order, VIDZ cannot fit into a 4 x 4-block level pipeline. Moreover, in system view, VIDZ induces extra synchronization buffers because of different scanning order with other modules (for example, residual decoder) which must follow scanning order in standard [1].
Intra MB interpolation window overlap region
0 1 4 5
2 3 6 7
8 9 12 13
10 11 14 15
Fig 4.8 Intra MB overlap pixels reusing
4.1.4 Inter MB Pixel Reusing
The second strategy of overlap pixel reusing is Inter MB Overlap Pixels reusing. Up to now, literatures of neighboring-based pixels reusing almost focus on reusing pixels which inside the same MB. However, there are overlap region between interpolated windows which located on neighboring MB can be reused. Fig 4.9 illustrates overlapped region for
neighboring interpolation windows on horizontal neighboring MB. Only stores horizontal MB overlap regions is our selection. This is because if we want to reuse vertical MB overlap regions, there are MB regions of entire frame width needed to be store and only provide limited space of improve efficiency. Subsection 4.3 will show the analysis.
0 1 4 5
Fig 4.9 Inter MB overlap pixels reusing
The required content buffers are 5 x 9 pixels and 9 x 5 pixels for horizontal and vertical overlapped region for neighboring interpolation windows respectively. In order to minimize the content buffer size, the lifetime analysis of reference data shows that only three horizontal and three vertical blocks is required to be saved in the worst case.Table 4.3 shows the lifetime analysis. Horizontal axis shows 4 x 4 partition ordering, vertical axis shows the used storages, and filed is which partition horizontal or vertical overlap region of partition is stored. For example, in partition 1, horizontal overlap region of partition 1 will be stored in H0 and vertical overlap region of partition 1 will be stored in V0. Content buffers can be implemented in local registers or SRAM. However, SRAM needs several cycles to finish content-swap operation, so we use local registers in order to minimize latency on carrying out content-swap.
Table 4.3 Storage requirement and lifetime analysis
4 x 4 Storage
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 …
H0 1 1 3 3 5 15 …
H1 7 1 1 3 …
H2 9 9 11 11 13 …
V0 0 1 2 7 12 13 0 1 2 …
V1 3 8 9 3 …
V2 4 5 6 11 …
Time
4.2 Limit of Reduced Memory Bandwidth
Our memory bandwidth reduction can be classified into data fetch time reducing and pixel reusing. In ideal condition, there exists reduction limit of memory bandwidth in terms of reduced fetch times. In ideal condition, all pixels locate on integer position. Table 4.4 shows all pixel position and their reduction percent. The reduction percent is original 9 x 9 interpolation window compare with exact interpolation window that only fetch required pixel.
In Table 4.4, even though all pixels locate on integer position, G, we can see the limit of memory bandwidth reduction is 80%. However, all pixels located in integer position is impossible in real sequence
Table 4.4 Summary of luma interpolation windows and reduction percent Pixel position Interpolation Window Size Reduction percent
G 4x4 80.25%
a, b, c 4x9 55.56%
d, h, n 9x4 55.56%
e, g, p, r 9x4+4x5 30.86%
f, i, j, k, q 9x9 0%
In another aspect, in terms of pixel reusing, Fig 4.10 illustrates all partitions can use all
region. Table 4.5 shows summary of reduction percent in different overlap region. We can see even though all partition have horizontal and vertical overlap regions can reuse in ideal condition, the limit of bandwidth reduction is 80%. However, if we want to reuse previous upper MB overlap region, each MB needs to be saved and only after process all following MB of frame width then can be reused and discarded because of characteristic of raster scanning. In other words, if we want to achieve upper MB overlap pixels reusing which is required to store MB overlap region of the entire row of frame. The storage depend on frame width and often very large. For example, it needs 21.6KB in 1080 HD. The storage is too large and only enhances 6% of memory bandwidth reduction. Hence, our selection is Intra MB with left MB. In ideal condition, we can achieve up to 74% bandwidth saving which is close to idea limit without huge overhead.
0 1 4 5
2 3 6 7
8 9 12 13
10 11 14 15
Previous upper MB overlap region
current MB Previous lefter MB
overlap region
Fig 4.10 All overlap region include between previous upper MB and left MB Table 4.5 Summary of reduction percent in different overlap region
Overlap region Reduction percent
(all) Intra MB 65.97%
Intra MB + left MB 74.07%
Intra MB + left MB + upper MB 80.25%
In terms of data fetch times reducing and data reusing, it will not both happen all in ideal condition at the same time. This is because of integer pixel need not other pixels to interpolate result. In other words, it only bypasses reference pixels, so there are no overlap pixels to be reused. Fig 4.11 illustrates two motion vectors of neighboring 4 x 4 blocks are the same, there is no overlap region between two interpolating windows for data reusing.
block 0 4x4 output pixels block 1
interpolation
interpolation
4x4 reference pixels
Fig 4.11 No overlap region can be reused
4.3 Summary
In this chapter, memory bandwidth, there are two directions adopted to reduce requirement of memory bandwidth. In these two directions, there are four strategies to achieve efficiently reducing memory bandwidth. Finally, the analysis of .reduced memory limit is discussed. The simulation result will show in chapter 5 and present our strategies is effective because of the close to limit of reduced memory bandwidth.
Chapter 5
Experiment Result
5.1 System Specification
Table 5.1 Video decoder specification in our design
H.264/AVC decoder
I, P, B slice
Variable block size: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4 Single reference frame (each direction)
Search range: [-128, +127.75]
Fractional motion resolution: quarter for luma, 1/8 for chroma Frame/Field coding
Scalable High Profile (future)
Decoding capability: H.264/AVC: 1080 HD, 30fps
SVC: 720 HD – 1080 HD, 30fps (future) Working Frequency:
H.264/AVC: 100 MHz
SVC: 150 MHz (future) External Memory and Bus
Table 5.1 lists the specification of our H.264 video decoder. Fig 5.1 shows the whole H.264/AVC video decoder. We can see there exists embedded compressor and embedded decompressor to further reduce memory bandwidth requirement. Fig 5.2 shows the simulation result that applies our reduction strategies of memory bandwidth [17]. Memory bandwidth can be saved 71~80% and is very close to the limit of our analysis result shown in subsection
4.3. Fig 5.3 shows the comparison with related work [14]. If we pay attention to the extreme conditions, we can see that the reduction of memory bandwidth is very close to the limit in Akiyo sequence. In addition, the difference of memory bandwidth reduction in Stefan sequence is the largest. After we further analyze Akiyo and Stefan sequence, Fig 5.4 shows the ratio of pixels position in Akiyo and Stefan sequence. The reviewing of fractional sample position for luma interpolation is showed in Fig 4.3. In Akiyo sequence, the ideal condition (integral pixels) occupy up to 90%, so the memory bandwidth reduction is very close to the limit of memory bandwidth. In Stefan sequence, the pixels position is closely uniform distribution. In other words, ideal condition is less. That is, when the ratio of fractional position increases, comparing with other works will shows we can significantly enhance bandwidth reduction (Up to 11%).
Deblocking Filter AHB Master/Slave Interface & SVC Arbiter
BUS Data Fetch Data Fetch &
Operation
Data Fetch Data Fetch &
Operation
Fig 5.1 Motion compensation engine for H.264 video decoder
Fig 5.2 Simulation results of bandwidth reduction strategies
Fig 5.3 Compare related works
Fig 5.4 Ratio of pixels position in Akiyo and Stefan sequence
Even though the above discussion depend on different sequence characteristic, however, Fig 5.5 and Fig 5.6 [8] show the luma and chroma integer/fractional motion vector proportion for different foreman-QCIF bit-rate. In high bit rate coding (128 kbps), fractional motion vector occupies about 80 %. However, in low bit rate (32 kbps), fractional part only occupies 40 %. Higher bit-rate, higher fractional MV proportion, has better quality with more execution time to read pixels data from frame memory than integer motion vector. This gap may become more obvious especially when SDRAM is used as frame memory. In other words, our proposal is more suitable in high bit-rate than previous works for higher reduction of memory bandwidth.
32 48 64 80 96 112 128
Luma integer/fractional motion vector proportion (foreman-QCIF)
bit rate(kbps)
proportion
integer fraction
Fig 5.5 Luma integer/fractional motion vector proportion for H.264/AVC
32 48 64 80 96 112 128
Chroma integer/fractional motion vector proportion (foreman-QCIF)
bit rate(kbps)
proportion
integer fraction
Fig 5.6 Chroma integer/fractional motion vector proportion for H.264/AVC
5.2 Comparison with Related Works
Table 5.2 lists the comparison with related works about motion compensation. We only focus on memory bandwidth reduction and interpolator design comparison. This is because memory bandwidth always is bottleneck of motion compensation and interpolator is key module in motion compensation. For another reason, each related works support different specification. We can see our memory bandwidth optimization is better than previous works although our storage is not least. However, our storage size is after trade-off and can get better performance. In terms of interpolator, [10] and [11] use hardware sharing to operate twice to achieve area efficiency. Even though these hardware sharing is suitable for Baseline Profile, but the poor throughput is not meet real-time decode in Main/High Profile. Moreover, our interpolator gate count is very close to these previous work [10] [11] and provide enough throughput performance in Main/High Profile.
Table 5.2 H.264decoder comparison with related work
ISCAS
Interpolator 20,686 15,000 13,027 21,506 11,823 13,201
total 43k 61k 32k 47k N/A 68k
Chapter 6
Conclusion and Future Work
6.1 Conclusion
Motion compensation engine consists of three parts: motion vector generator, interpolator, and weighted predictor. Firstly, motion vector generator needs to support many tools in Main/High Profile. The challenge of motion vector generator is high complexity. We use hardware sharing to deal with double motion vectors, use coordinate mapping method to process direct modes, and merge MBAFF mode LUT and non-MBAFF mode LUT effectively to reduce the complexity. The design of interpolator, 4-parallel separate 1-D architecture gives the most space on high throughput compared with other proposed architectures. Hence, our interpolator is suitable for B slice and our restructured design can significantly reduce area
Motion compensation engine consists of three parts: motion vector generator, interpolator, and weighted predictor. Firstly, motion vector generator needs to support many tools in Main/High Profile. The challenge of motion vector generator is high complexity. We use hardware sharing to deal with double motion vectors, use coordinate mapping method to process direct modes, and merge MBAFF mode LUT and non-MBAFF mode LUT effectively to reduce the complexity. The design of interpolator, 4-parallel separate 1-D architecture gives the most space on high throughput compared with other proposed architectures. Hence, our interpolator is suitable for B slice and our restructured design can significantly reduce area