Multimedia Application - 適用於多核心PlayStation 3平台之基於多層級管線模型的多媒體平行處理技術

The data rate and compression ratio of multimedia processing are improved as the complexity of algorithm grows. In multimedia decoding applications, the high-definition (HD)

resolution is a basic requirement in many markets, such as DTV, multimedia games, and multimedia playing on monitors. The even higher performance pursued by consumers make engineers design more powerful devices while keeping the price low.

The high-end consumer electronics need to run versatile multimedia applications. For examples, audio standards are AAC, MP3, Dolby Digital (AC3), etc. And multimedia standards are M-JPEG, MPEG-1, 2, and 4, H.263, H.264, etc. Thus the implementation of multimedia coding by software is a cost-effective solution. Processor-based architectures can use software patches to keep up with new multimedia applications. However, conventional single-core processor architectures are unable to provide sufficient computing power for advanced real-time multimedia processing. Thus the parallelisms in multimedia applications should be exploited by processor-based system with high performance to meet the real-time specifications. We take H.264, the latest multimedia standard available for example and as our target. H.264 standard is introduced as following.

H.264 Standard

H.264 / MPEG-4 Part 10 is the latest video compression standard developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG). The final drafting work on the first version of the standard was completed in May 2003.

H.264/AVC provides high compression efficiency with lower bit rates. Figure 1-1 shows the H.264 decoder block diagram. The decoder receives compressed bitstream from the NAL.

The data are entropy decoded and reordered to produce a set of quantized coefficients X.

These are rescaled and inverse transformed to give Dn’. Using the header information decoded from the bitstream, then the decoder constructs a prediction macroblock P. P is added to Dn’ to produce uF’n which this is filtered to create the decoded macroblock F’n. The

characteristics of each block are addressed as following.

Figure 1-1 H.264 Decoder Block Diagram

z Entropy Decoding

To eliminate the syntax redundancy, the arithmetic coding is applied. The syntax above the slice layer is encoded as fixed- or variable-length codes. At the slice layer and below, H.264 standard specifies two types of entropy coding. Elements are coded using Content Adaptive Variable Length Coding (CAVLC) or Content Adaptive Binary Arithmetic Coding (CABAC) according to the entropy encoding mode.

z Quantization and Transformation

H.264/AVC uses three transforms depending on the type of residual data that is to be coded: a Hadamard transform for the 4x4 array of luminance DC coefficients in 16x16 intra-prediction macroblocks, a Hadamard transform for the 2x2 array of chrominance DC coefficients and a DCT-based transform for all other 4x4 blocks in the residual data.

Data within a macroblock are transmitted in the order shown as Figure 1-2. If the macroblock is coded in 16x16 intra-prediction, then the block labeled ‘-1’, containing the transformed DC coefficient of each 4x4 luminance block, is transmitted first. Next, the luminance residual block 0-15 are transmitted in the order shown as Figure 1-2 (the DC coefficient in a macroblock coded in 16x16 intra-prediction mode are not sent). Block 16 and 17 containing a 2x2 array of DC coefficients from the Cb and Cr chrominance components

are sent. Finally, chrominance residual blocks 18-25 (without DC coefficients) are sent.

Figure 1-2 Scanning Order of Residual Blocks within a Macroblock

z Intra Prediction

In intra mode a prediction block is formed based on previously encoded and reconstructed blocks and is subtracted from the current block prior to encoding. The prediction block is formed for each 4x4 block or for a 16x16 macroblock for luminance samples and 8x8 macroblock for chrominance samples.

There are a total of nine optional prediction modes for each 4x4 luminance block shown as Figure 1-3. The arrows indicate the direction of prediction in each mode. For modes 3-8 the predicted samples are formed from a weighted average of the prediction samples A-M. For example, if mode 4 is selected, the top-right sample of 4x4 submacroblock is predicted by:

round(B/4+C/2+D/4).

Figure 1-3 4x4 Luminance Prediction Modes

As an alternative to the 4x4 luminance prediction modes described above, the entire 16x16 luminance component of a macroblock may be predicted in one operation. Four modes are available shown as Figure 1-4.

Figure 1-4 16x16 Luminance Prediction Modes

Each 8x8 chroma component of an intra coded macroblock is predicted from previously encoded chrominance samples above and/or to the left and both chrominance components always use the same prediction mode. The four prediction modes are very similar to the 16x16 luminance prediction modes, except the numbering of the modes is different. The modes are DC (mode 0), horizontal (mode 1), vertical (mode 2) and plane (mode 3).

z Inter Prediction

Inter prediction creates a prediction model from one or more previously encoded video frames. The model is formed by shifting samples in the reference frame(s) (motion

compensated prediction). H.264 uses block-based motion compensation similar to previous standards.

H.264 supports motion compensation block sizes ranging from 16x16 to 4x4 luminance samples with many options between the two. The luminance component of each 16x16 macroblock may be split up in 4 ways including 16x16, 8x16, 16x8 and 8x8. If the 8x8 mode is chosen, each of the four 8x8 macroblock partitions within the macroblock may be split in a further 4 ways including 8x8, 4x8, 8x4 and 4x4. These partitions and sub-partitions give rise to a large number of possible combinations within each macroblock. This method of partitioning macroblocks into motion compensated sub-blocks of varying size is known as tree structured motion compensation.

A separate motion vector is required for each partition or sub-partition. Each motion vector must be coded and transmitted. The choice of each partition must be encoded in the compressed bitstream. It can cost a significant number of bits to encoding a motion vector for each partition. Since there are high correlations between motion vectors of the neighboring partitions, the motion vector can be predicted by nearby ones. Hence the motion vector prediction is generated by the motion vector of the adjacent partitions.

In order to increase the accuracy of motion compensation, H.264 supports quarter-pixel resolution for luma components and one-eight-pixel resolution for chroma components. If the prediction result of sub pixel is better than that of the integer pixel, the sub pixel will be chosen.

The half-pixel samples are obtained by applying a six tap filter with weights (1/32, -5/32, 20/32, 20/32, -5/32, 1/32). For example, a half pixel ‘b’ in Figure 1-5 is obtained from the six horizontal integer neighbors E, F, G, H, I, and J with the formulation: b = ((E- 5F+20G+20H-5I+J )/32).

Furthermore, the quarter-pixel samples can be calculated after all the half-pixel macroblock are available. They are produced by linear interpolation between two of their adjacent samples. For example, a quarter pixel ‘a’ in Figure 1-5 can be calculated by: a = (G+b)/2.

Figure 1-5 Inter Prediction of Luminance Integer-Pixel, Half-Pixel and Quarter-Pixel Positions

As shown in Figure 1-6, the chrominance samples can be calculated by linear interpolation of the neighbor pixels as following equation:

[(8-d_x)(8-d_y)A+d_x(8-d_y)B+(8-d_x)d_yC+d_xd_yD]/64

Figure 1-6 Inter Prediction of Chrominance samples

z Deblocking Filter

One drawbacks of the block base video compression mentioned above is the visible block boundaries. It is so called blocking effects: the lower bit rate the compression is, the more obvious the edges are. To eliminate the blocking effect, a deblocking filter is applied after the inverse transform in both encoder and decoder. As shown in Figure 1-7, it is applied to vertical or horizontal edges of 4x4 blocks in a macroblock, in the fallowing order: four vertical boundaries (a, b, c, then d) of luma, four horizontal boundaries (e, f, g, then h) of lima, and two vertical boundaries (i, j) horizontal boundaries (k, l).

Figure 1-7 Edge Filtering Order in a Macroblock

The filtering is adaptively applied according to the boundary strength and the gradient across the boundaries. The boundary strength depends on the compression mode of a macroblock, the quantization parameter, motion vector, frame or field coding decision, and pixel values. With this filter, subjective quality is significant improved. This filter also reduces the bits rate with ratio of 5%–10% compared with non-filtered video with the same objective quality.

z Data Dependencies of H.264/AVC Decoder

There are highly dependencies in H.264/AVC decoder which causing the difficulty for parallel programming. In entropy decoding, the bitstream must be decoded in order. As shown

in Figure 1-8, for a macroblock, intra prediction needs the upper macroblock and left macroblock to be decoded. A 4x4 luma submacroblock needs the upper 4x4 submacroblock, left 4x4 submacroblock and upper right 4x4 submacroblock to be decoded in advance.

Figure 1-8 Dependencies in Intra Prediction Mode

In inter prediction mode, data dependencies are within the search range of the reference frame is need for interpolation as shown in Figure 1-9.

Figure 1-9 Dependencies in Inter Prediction Mode

In deblocking filter, the four neighbor rows pixels of upper macroblock and four neighbor columns pixels of left macroblock are needed as shown in Figure 1-10.

Figure 1-10 Dependencies in Deblocking Filter

在文檔中適用於多核心PlayStation 3平台之基於多層級管線模型的多媒體平行處理技術 (頁 17-26)