• 沒有找到結果。

CHAPTER 3 MOTION COMPENSATION DESIGN FOR H.264/AVC MAIN/HIGH

3.4 W EIGHTED P REDICTION

Weighted prediction is the final stage of motion compensation behind the interpolator.

Weighted prediction is a tool of scaling motion compensated samples to increase the video quality in H.264/AVC video decoding. In this subsection, weighted predictor architecture is proposed to collocate with interpolator and eliminate the latency overhead. Chen‟s [16]

proposed weighted prediction architecture has low complexity. However, it has long critical path and large memory requirement (1.5kb). The design of Azevedo‟s [12] weighted predictor is simply implemented by direct mapping design and require an embedded memory to store rounding coefficient. Compared with direct mapping design which equation is listed in Eq.

3.3, we can use the same predictor twice to generate predicted value, first is LIST_0 prediction and second is LIST_1 prediction as shown in Eq. 3.4. The component of rounding and offset can be advanced and combined in the same stage. Therefore, the predictor can be further modified to reduce the critical path as shown in Eq. 3.5. Moreover from Eq. 3.5, the W0 means weight factor and the value depend on weight flag from bit-stream. When weight flag is equal to 1, the value of weight factor shall be in the range of -128 to 127, inclusive.

When weight flag is equal to 0, weight factor shall be in the range of 20 to 27, inclusive [1].

From the above discussion, if we determine the highest weight factor two bit we can use an eight bits multiplier and shifter instead of a nine bits multiplier. The predictor is shown in Fig 3.17.

MUX +

offset

<<7

*

Weight factor[7:0]

predPart

LogWD

+

>>

Round Weight factor[8:7]

<<

Fig 3.17 Weighted predictor design

Moreover, when B slice is involved, we use hardware sharing to operate twice. In addition, a 4 x 4 storages array is required to store intermediate results. Fig 3.18 illustrates the complete weighted predictor design. The same as temporal direct mode in motion vector generator, weighted predictor has implicit mode which weighting factors are calculated based on the relative temporal positions of LIST_0 and LIST_1 reference picture. Weighting factor in the implicit mode is derived from temporal direct mode data-path in order to reduce hardware cost. Furthermore, divider occupies the main area cost and computation time in the temporal direct mode design. We can use loop-up table (LUT) to replace divider because the

dividend is a constant value. Table 3.2 lists the comparison for implementation results. For [12], it was not presented in comparison because lack of related detail information.

Predictor 1 Predictor

2

B-L1 M U X

P slice

C lip

Luma/Cr

4x4 Buffer Luma/Cb

A v er ag e

B-L0

Fig 3.18 Entire weight predictor architecture

ICASSP‟06[16] proposed

Multiplier (bits) 9 8

Technology .18um .90um

Gate count 12,960 6,412

Working frequency 87MHz 100MHz

Table 3.2 Comparison of execution cycles for different architectures

3.5 Summary

In this chapter, a motion compensation engine for H.264/AVC Main/High Profile decoder is presented. As for sharing design issue for multi-profile, our MVG use the same module and storages to deal with P slice and B slice which include MBAFF and non MBAFF.

Our restructured interpolator presents the area efficiently compared with traditional design and it is suitable for high throughput application such as coded in B slice video decoder.

Besides, the weighted predictor through hardware sharing with temporal direct mode and critical path shorten to achieve area efficiency. When weighted predictor collocates with interpolator, it only requires one cycle latency overhead.

Chapter 4

Memory Bandwidth Reduction

4x4 output pixels

9x9 reference pixels

interpolation

Fig 4.1 4 x 4 block window and the corresponding 9 x 9 interpolation window

Considering luma interpolation, the half position samples interpolated by applying 6-tap FIR filter and quarter position samples performed by applying using bilinear filter. It means interpolator needs six reference pixels to produce one interpolated pixel. Fig 4.1 shows to interpolate each fractional sample value for each 4 x 4 block size; it needs 9 x 9 interpolation window. Chroma interpolation, of which concept is similar to luma, interpolates each fractional sample value for each 2 x 2 block size, it needs 3 x 3 interpolation windows. When frame size is large and frame rate is high, interpolation causes heavy loading of memory bandwidth. Moreover, motion compensation involves Main/High Profile; it supports B slices in which reference frame from one direction increase to two directions. From the above discussion, Main/High Profile doubles the memory bandwidth requirement. In worst case,

interpolator needs memory bandwidth requirement, 398MB/s in P slices and 796MB/s in B slices, when support 1080 HD @ 30 fps. The heavy loading of memory bandwidth also means huge power consumption for bus activity and data operation.

The rest of this chapter is organized as follows. Firstly, section 4.1 discusses our reduction strategies of memory bandwidth. In addition, an analysis of bandwidth reduction limit is presented in section 4.2. Finally, summary is given in section 4.3.

4.1 Reduction strategies of memory bandwidth

Memory bandwidth always dominates the performance of entire video decoder. Several methods have been proposed to reduce the required memory bandwidth and they can be mainly classified to two directions, first one is frame recompression and another one is redundancy reduction of pixels transmission. With regard to the frame recompression, Fig 4.2 illustrates the concept. Frame data will be compressed before writing to frame memory, and reference frame data will be decompressed before reading into video decoder. However, frame recompression method must consider many issues which like necessary random access capability demanded from motion compensation, low complexity property due to area cost and power saving, and minimize required additional execution cycles to compress/decompress data such that meet the real time throughput requirement of video decoder. Here we do not go into detail because our system have two dedicated modules, embedded compressor, between motion compensation and frame memory and embedded decompressor between frame memory and de-blocking module respectively.

Video Decoder

Frame Memory

recompress

decompress

Global bus

Fig 4.2 Embedded compress/decompress method

As for second solution, transmission reduction of redundant pixels, which can be classified into two solutions that first one is data fetch time reducing and the other one is data (pixel) reusing. The following subsection will discuss the detail of reduction strategies of memory bandwidth. Subsection 4.2.1 illustrates first strategy of data fetch times reducing.

Subsection 4.2.2 gives second strategy of data fetch times reducing. Subsection 4.2.3 illustrates first strategy of data reusing. Finally, subsection 4.2.4 presents second strategy of data reusing.

4.1.1 Exact Fetch Necessary Pixels

a c

Fig 4.3 Fractional sample positions for quarter sample luma interpolation

Fig 4.3 illustrates the luma samples „a‟ to „s‟ at fractional sample positions. In traditional method, when interpolate fractional pixel, it always fetch 9x9 interpolation windows.

However, there are not all pixels required in all fractional sample position. For example, the sample at half sample position labeled b is derived by the nearest integer position samples in the horizontal direction. Similarly, the sample at half sample position labeled h is derived by the nearest integer position samples in the vertical direction. Fig 4.4 illustrates interpolation of the samples at a, b, and c positions only need 9 x 4 interpolation windows. Fig 4.5 illustrates

interpolation of the samples at d, h, and n positions only need 4 x 9 interpolation windows.

We can depend on motion vector value to exact fetch necessary pixels instead of fetch 9 x 9 interpolation window. Similar to luma interpolation, chroma interpolation can determine motion vector to decide interpolation window as well. Table 4.1 shows the summary of luma interpolation windows. Table 4.2 shows the summary of chroma interpolation windows. The strategy is also used in other design [14], [10], and [11]. As for bandwidth reduction result, we will show it later.

4x4 output pixels

9x4 reference pixels

interpolation

Fig 4.4 Fractional sample only need horizontal samples .

interpolation

4x9 reference pixels 4x4 output pixels

Fig 4.5 Fractional sample only need vertical samples

Table 4.1 Summary of luma interpolation windows Pixel position Interpolation Window Size G (Integer) 4x4

a, b, c (Horizontal) 4x9 d, h, n (Vertical) 9x4 e, g, p, r 9x4+4x5

others 9x9

Table 4.2 Summary of chroma interpolation windows Pixel position Interpolation Window Size

Integer 2x2

Horizontal 3x2

Vertical 2x3

Others 3x3

4.1.2 Pre-fetch Mechanism

The second strategy of reduced fetching times is Pre-fetch Mechanism. Frame memories are such the largest memory storage over the entire video decoder that it are located on off-chip. Because bus interface has fixed width, every fetching may fetch unneeded pixels when fetch interpolation windows. If we save these unneeded pixels, it may be used in the future. Hence, we can further reduce fetching times. Fig 4.6 illustrates the interpolation window mismatch with bus interface and pre-fetch mechanism. The strategy is also used in other design [11]

Bus interface is 32bit=4 pixels

9x9 pixels windows size

Pre-fetch pixels

...

Memory boundary

Fig 4.6 Pre-fetch mechanism

4.1.3 Intra MB Pixel Reusing

4

4

5B

A

B

A

4

4

5 interpolation

interpolation

interpolation

interpolation

Fig 4.7 4x4 block window and the corresponding 9x9 interpolation window and overlapped region for neighboring interpolation window

Similar to reduced fetching times, pixel reusing can separate into intra MB overlap pixels reusing and inter MB overlap pixels reusing. The concept of overlap pixels reusing is if two motion vectors of horizontal neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be reused. Similarly, if two motion vectors of vertical neighboring 4 x 4 blocks are the same, 9 x 5 overlap region between two interpolation windows can be reused. Fig 4.7 illustrates four motion vectors of neighboring 4 x 4 blocks are the same and the corresponding 9 x 9 interpolation windows. We can see there are two vertical 5 x 9 overlap region indicated by “A” and two horizontal 9 x 5 overlap region indicated by “B” can be reused.

The first strategy of overlap pixels reusing is Intra MB Overlap Pixels reusing. Fig 4.8

illustrates the Intra MB overlap pixels reusing. There are some methods have been proposed in [14-16]. In Tsai‟s [14],Tsai proposed VIDZ to achieve horizontal and vertical data reusing.

Besides, based on the VIDZ flow, all vertically overlapped interpolation windows can be reused without additional storages. However, the violation of the inherent double-z-scan order, VIDZ cannot fit into a 4 x 4-block level pipeline. Moreover, in system view, VIDZ induces extra synchronization buffers because of different scanning order with other modules (for example, residual decoder) which must follow scanning order in standard [1].

Intra MB interpolation window overlap region

0 1 4 5

2 3 6 7

8 9 12 13

10 11 14 15

Fig 4.8 Intra MB overlap pixels reusing

4.1.4 Inter MB Pixel Reusing

The second strategy of overlap pixel reusing is Inter MB Overlap Pixels reusing. Up to now, literatures of neighboring-based pixels reusing almost focus on reusing pixels which inside the same MB. However, there are overlap region between interpolated windows which located on neighboring MB can be reused. Fig 4.9 illustrates overlapped region for

neighboring interpolation windows on horizontal neighboring MB. Only stores horizontal MB overlap regions is our selection. This is because if we want to reuse vertical MB overlap regions, there are MB regions of entire frame width needed to be store and only provide limited space of improve efficiency. Subsection 4.3 will show the analysis.

0 1 4 5

Fig 4.9 Inter MB overlap pixels reusing

The required content buffers are 5 x 9 pixels and 9 x 5 pixels for horizontal and vertical overlapped region for neighboring interpolation windows respectively. In order to minimize the content buffer size, the lifetime analysis of reference data shows that only three horizontal and three vertical blocks is required to be saved in the worst case.Table 4.3 shows the lifetime analysis. Horizontal axis shows 4 x 4 partition ordering, vertical axis shows the used storages, and filed is which partition horizontal or vertical overlap region of partition is stored. For example, in partition 1, horizontal overlap region of partition 1 will be stored in H0 and vertical overlap region of partition 1 will be stored in V0. Content buffers can be implemented in local registers or SRAM. However, SRAM needs several cycles to finish content-swap operation, so we use local registers in order to minimize latency on carrying out content-swap.

Table 4.3 Storage requirement and lifetime analysis

4 x 4 Storage

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 …

H0 1 1 3 3 5 15 …

H1 7 1 1 3 …

H2 9 9 11 11 13 …

V0 0 1 2 7 12 13 0 1 2 …

V1 3 8 9 3 …

V2 4 5 6 11 …

Time

4.2 Limit of Reduced Memory Bandwidth

Our memory bandwidth reduction can be classified into data fetch time reducing and pixel reusing. In ideal condition, there exists reduction limit of memory bandwidth in terms of reduced fetch times. In ideal condition, all pixels locate on integer position. Table 4.4 shows all pixel position and their reduction percent. The reduction percent is original 9 x 9 interpolation window compare with exact interpolation window that only fetch required pixel.

In Table 4.4, even though all pixels locate on integer position, G, we can see the limit of memory bandwidth reduction is 80%. However, all pixels located in integer position is impossible in real sequence

Table 4.4 Summary of luma interpolation windows and reduction percent Pixel position Interpolation Window Size Reduction percent

G 4x4 80.25%

a, b, c 4x9 55.56%

d, h, n 9x4 55.56%

e, g, p, r 9x4+4x5 30.86%

f, i, j, k, q 9x9 0%

In another aspect, in terms of pixel reusing, Fig 4.10 illustrates all partitions can use all

region. Table 4.5 shows summary of reduction percent in different overlap region. We can see even though all partition have horizontal and vertical overlap regions can reuse in ideal condition, the limit of bandwidth reduction is 80%. However, if we want to reuse previous upper MB overlap region, each MB needs to be saved and only after process all following MB of frame width then can be reused and discarded because of characteristic of raster scanning. In other words, if we want to achieve upper MB overlap pixels reusing which is required to store MB overlap region of the entire row of frame. The storage depend on frame width and often very large. For example, it needs 21.6KB in 1080 HD. The storage is too large and only enhances 6% of memory bandwidth reduction. Hence, our selection is Intra MB with left MB. In ideal condition, we can achieve up to 74% bandwidth saving which is close to idea limit without huge overhead.

0 1 4 5

2 3 6 7

8 9 12 13

10 11 14 15

Previous upper MB overlap region

current MB Previous lefter MB

overlap region

Fig 4.10 All overlap region include between previous upper MB and left MB Table 4.5 Summary of reduction percent in different overlap region

Overlap region Reduction percent

(all) Intra MB 65.97%

Intra MB + left MB 74.07%

Intra MB + left MB + upper MB 80.25%

In terms of data fetch times reducing and data reusing, it will not both happen all in ideal condition at the same time. This is because of integer pixel need not other pixels to interpolate result. In other words, it only bypasses reference pixels, so there are no overlap pixels to be reused. Fig 4.11 illustrates two motion vectors of neighboring 4 x 4 blocks are the same, there is no overlap region between two interpolating windows for data reusing.

block 0 4x4 output pixels block 1

interpolation

interpolation

4x4 reference pixels

Fig 4.11 No overlap region can be reused

4.3 Summary

In this chapter, memory bandwidth, there are two directions adopted to reduce requirement of memory bandwidth. In these two directions, there are four strategies to achieve efficiently reducing memory bandwidth. Finally, the analysis of .reduced memory limit is discussed. The simulation result will show in chapter 5 and present our strategies is effective because of the close to limit of reduced memory bandwidth.

Chapter 5

Experiment Result

5.1 System Specification

Table 5.1 Video decoder specification in our design

H.264/AVC decoder

I, P, B slice

Variable block size: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4 Single reference frame (each direction)

Search range: [-128, +127.75]

Fractional motion resolution: quarter for luma, 1/8 for chroma Frame/Field coding

Scalable High Profile (future)

Decoding capability: H.264/AVC: 1080 HD, 30fps

SVC: 720 HD – 1080 HD, 30fps (future) Working Frequency:

H.264/AVC: 100 MHz

SVC: 150 MHz (future) External Memory and Bus

Table 5.1 lists the specification of our H.264 video decoder. Fig 5.1 shows the whole H.264/AVC video decoder. We can see there exists embedded compressor and embedded decompressor to further reduce memory bandwidth requirement. Fig 5.2 shows the simulation result that applies our reduction strategies of memory bandwidth [17]. Memory bandwidth can be saved 71~80% and is very close to the limit of our analysis result shown in subsection

4.3. Fig 5.3 shows the comparison with related work [14]. If we pay attention to the extreme conditions, we can see that the reduction of memory bandwidth is very close to the limit in Akiyo sequence. In addition, the difference of memory bandwidth reduction in Stefan sequence is the largest. After we further analyze Akiyo and Stefan sequence, Fig 5.4 shows the ratio of pixels position in Akiyo and Stefan sequence. The reviewing of fractional sample position for luma interpolation is showed in Fig 4.3. In Akiyo sequence, the ideal condition (integral pixels) occupy up to 90%, so the memory bandwidth reduction is very close to the limit of memory bandwidth. In Stefan sequence, the pixels position is closely uniform distribution. In other words, ideal condition is less. That is, when the ratio of fractional position increases, comparing with other works will shows we can significantly enhance bandwidth reduction (Up to 11%).

Deblocking Filter AHB Master/Slave Interface & SVC Arbiter

BUS Data Fetch Data Fetch &

Operation

Data Fetch Data Fetch &

Operation

Fig 5.1 Motion compensation engine for H.264 video decoder

Fig 5.2 Simulation results of bandwidth reduction strategies

Fig 5.3 Compare related works

Fig 5.4 Ratio of pixels position in Akiyo and Stefan sequence

Even though the above discussion depend on different sequence characteristic, however, Fig 5.5 and Fig 5.6 [8] show the luma and chroma integer/fractional motion vector proportion for different foreman-QCIF bit-rate. In high bit rate coding (128 kbps), fractional motion vector occupies about 80 %. However, in low bit rate (32 kbps), fractional part only occupies 40 %. Higher bit-rate, higher fractional MV proportion, has better quality with more execution time to read pixels data from frame memory than integer motion vector. This gap may become more obvious especially when SDRAM is used as frame memory. In other words, our proposal is more suitable in high bit-rate than previous works for higher reduction of memory bandwidth.

32 48 64 80 96 112 128

Luma integer/fractional motion vector proportion (foreman-QCIF)

bit rate(kbps)

proportion

integer fraction

Fig 5.5 Luma integer/fractional motion vector proportion for H.264/AVC

32 48 64 80 96 112 128

Chroma integer/fractional motion vector proportion (foreman-QCIF)

bit rate(kbps)

proportion

integer fraction

Fig 5.6 Chroma integer/fractional motion vector proportion for H.264/AVC

5.2 Comparison with Related Works

Table 5.2 lists the comparison with related works about motion compensation. We only focus on memory bandwidth reduction and interpolator design comparison. This is because memory bandwidth always is bottleneck of motion compensation and interpolator is key module in motion compensation. For another reason, each related works support different specification. We can see our memory bandwidth optimization is better than previous works although our storage is not least. However, our storage size is after trade-off and can get better performance. In terms of interpolator, [10] and [11] use hardware sharing to operate twice to achieve area efficiency. Even though these hardware sharing is suitable for Baseline Profile, but the poor throughput is not meet real-time decode in Main/High Profile. Moreover, our interpolator gate count is very close to these previous work [10] [11] and provide enough throughput performance in Main/High Profile.

Table 5.2 H.264decoder comparison with related work

ISCAS

Interpolator 20,686 15,000 13,027 21,506 11,823 13,201

total 43k 61k 32k 47k N/A 68k

Chapter 6

Conclusion and Future Work

6.1 Conclusion

Motion compensation engine consists of three parts: motion vector generator, interpolator, and weighted predictor. Firstly, motion vector generator needs to support many tools in Main/High Profile. The challenge of motion vector generator is high complexity. We use hardware sharing to deal with double motion vectors, use coordinate mapping method to process direct modes, and merge MBAFF mode LUT and non-MBAFF mode LUT effectively to reduce the complexity. The design of interpolator, 4-parallel separate 1-D architecture gives the most space on high throughput compared with other proposed architectures. Hence, our interpolator is suitable for B slice and our restructured design can significantly reduce area

Motion compensation engine consists of three parts: motion vector generator, interpolator, and weighted predictor. Firstly, motion vector generator needs to support many tools in Main/High Profile. The challenge of motion vector generator is high complexity. We use hardware sharing to deal with double motion vectors, use coordinate mapping method to process direct modes, and merge MBAFF mode LUT and non-MBAFF mode LUT effectively to reduce the complexity. The design of interpolator, 4-parallel separate 1-D architecture gives the most space on high throughput compared with other proposed architectures. Hence, our interpolator is suitable for B slice and our restructured design can significantly reduce area

相關文件