T HESIS O RGANIZATION - 整合於H.264/AVC HDTV解碼器的無失真嵌入式壓縮方法

CHAPTER 1 INTRODUCTION

1.3 T HESIS O RGANIZATION

This thesis is organized as follows. At first, the reviews of prior works are described in

Chapter 2. The lossless embedded compression algorithm integrated with H.264/AVC decoder is proposed in Chapter 3. Furthermore, the architecture of the proposed algorithm integrated with H.264/AVC HDTV decoder is proposed in Chapter 4. Some performance evaluation and implementation results are shown in Chapter 5, moreover, comparison with related works is shown in this chapter. Finally, the contributions of this thesis and other issues that are worth to have further discussion and research are made in Chapter 6.

Chapter 2 Reviews of Prior Works

In this section, some literatures are discussed and they are divided into third parts. The first is to review the existing lossy embedded compression algorithm. The second is to introduce existing lossless embedded compression algorithm. Finally, the multi-mode embedded compression algorithm is introduced in the third part.

2.1 Lossy Embedded Compression Algorithm

A popular technique for lossy embedded compression algorithm is a transform-based approach in which a frame is decomposed into small blocks that are transformed into a frequency domain by a simple transform such as DCT, Hadamard Transform or its variations [16]. Then, the frequency domain coefficients are compressed by quantization followed by variable length encoding such as Golomb-Rice coding. This approach, in general, requires a large amount of computation to reduce the quality degradation by compression algorithm. Another approach is a downsampling-based compression algorithm [17] that requires a relatively small amount of computation, but the quality may be degraded due to the loss of an edge pattern in the course of downsampling for compression and upsampling for decompression. Another spatial domain compression based on DPCM is proposed in [18] and an adaptive vector quantization scheme is presented in [19].

Lossy embedded compression with fixed compression ratio [10]-[20] can guarantee the size of compressed data of each block. Thus it is able to reduce not only external memory access but also the requirement of external memory size. However lossy embedded compression is not suitable for any kind of applications if the high video quality requirement is necessary. It will lead to quality degradation due to the error propagation (i.e. drift effect). Therefore it is more suitable for video

conferencing or video on mobile hand-held systems where quality is of much less important or Scalable Video Coding (SVC) systems with motion-compensated temporal filtering scheme.

2.2 Lossless Embedded Compression Algorithm

Lossless embedded compression [9] can guarantee no quality loss of video data, and hence no drifting effect exists in the video systems. But it will produce uncertain compressed size of each block. Therefore lossless embedded compression can not reduce the memory size, but it can still reduce the access power of external memory and bandwidth requirement of the system bus.

2.3 Multi-Mode Embedded Compression Algorithm

A recent researcher proposed a multi-mode compression method by adopting a Set-Partitioning in Hierarchical Trees (SPIHT) algorithm [21] to support both lossy and lossless compression. Because the SPIHT algorithm features to simply reach lossy/lossless compression, fixed compression ratio, rate and quality control, hence it has been adopted for a purposed of frame re-compression.

Although the SPIHT algorithm has many good properties that make it suitable for embedded compression codec engine, there are two main disadvantages for SPIHT in VLSI design. The first disadvantage of SPIHT algorithm is the large buffer size between DWT and SPIHT. DWT is a word-level arithmetic process, while the SPIHT engine encodes/decodes coefficients bitplane by bitplane. Moreover, being different with EBCOT in JPEG 2000, SPIHT requires image-level access.

This means that when coding one bitplane, the entire current bitplane must be available. This mismatch of the data flow between DWT and SPIHT coding engine makes it necessary to buffer DWT coefficients of entire image. This also makes the latency between the first input and the first output to be at least the duration of performing DWT of entire image.

The second disadvantage of SPIHT algorithm is the large buffer inside SPIHT engine. In a

straightforward implementation of SPIHT, 6L² × log2L (L is the image width) bits are needed for the LIS, LIP, and LSP buffers in SPIHT algorithm. This buffer size is even larger than the image itself. The buffer inside SPIHT engine with size more than a quarter of image size is needed, and this is also too large for a low-cost design.

Therefore, although researcher presented an efficient architecture, it still cannot reach high-definition real-time compression and decompression duo to the extensive processing cycles.

2.4 Summary

From the discussion above, we review the existing methods and classify a great diversity of existing algorithms as well. We could find that those algorithms are stand-alone methods. That means most of previous methods are performed independently of a video compression standard and therefore do not take advantage of the information obtained during the processing of the compression standard.

For the real-time H.264/AVC HDTV video decoder, the high video quality requirement is necessary; hence we will desire no quality degradation due to the embedded compression.

Therefore lossless embedded compression codec engine is adopted for our proposed method integrated with H.264/AVC HDTV decoder. And we might make use of helpful information from H.264/AVC to achieve high compression efficiency, low complexity and high throughput architecture for reducing bandwidth requirement of the system bus.

Chapter 3 Lossless Embedded Compression Algorithm Integrated with H.264/AVC Decoder

To achieve the requirements of our proposed method integrated with H.264/AVC decoder, the proposed lossless embedded compression algorithm is presented in this chapter. All those aforementioned methods are performed independently of the video coding standard and therefore do not take advantage of the information obtained during the processing of the coding standard.

The recent researcher proposes a new compression algorithm that makes use of the information from H.264 intra prediction results [20]. Our proposed method is bases on this idea and moreover improves the compression efficiency to be suitable for our lossless compression requirement.

The flowchart of the proposed algorithm is shown in FIG. 2. In the H.264/AVC decoder, intra prediction result is the best mode among 9 possible prediction modes for every 4x4 blocks. Since the selected mode gives information about the characteristic of the 4x4 block, the proposed algorithm uses this information to select the three kinds of scan modes for the 4x4 block and performs DPCM along this scan order. The code length predictor could select the shortest code length among three kinds of the scan modes and further compressed by Golomb-Rice coding. If the code length predictor calculates the total code length of the 4x4 block exceed 128 bits limit, we will directly transfer the 4x4 pixels to the system bus. As a result, the proposed algorithm achieves the average compression ratio of more than 2 without quality degradation.

FIG. 2. The flowchart of the proposed lossless embedded compression algorithm

3.1 Characterize the Functionalities of Proposed Algorithm

In this section, we would discuss the functionality of each block diagram as shown in FIG. 2.

They are characterized as follows:

3.1.1 Scan Modes Decision

For the compression efficiency of DPCM, the prediction errors between successive data must be small so that the data can be represented by a small number of bits. The amount of the prediction errors depend on the image type of a 4x4 block as well as the scan order. For example, if a certain 4x4 block has vertical stripes, a scan order along the vertical direction is more efficient than that along the horizontal direction. Therefore, it is important to select the scan direction that is suitable for a given image data.

For the 4 x 4 intra prediction results, there are nine different intra prediction modes that can be

chosen, with conceptual prediction directions as illustrated in FIG. 3 (mode 2, not shown in the figure, is the “DC” averaging mode). The recent researcher [20] presented eight different scan modes as shown in FIG. 4 in which the arrowed lines show the scan order. These eight modes are similar to the intra 4x4 prediction modes. Note that H.264/AVC 4x4 intra prediction has nine different modes and scan_node_2 (the DC mode) is in excluded in FIG. 4 because the DC mode does not give much information for a scan order selection. The eight modes cover various image types for DPCM scan. For example, scan_mode_0 is suitable for an image with vertical stripes while scan_mode_1 is suitable for horizontal stripes. An image with diagonal stripes may be suitable for one of the other modes.

FIG. 3. Nine prediction modes for the intra 4 x 4 prediction in the H.264 standard

FIG. 4. Eight kinds of scan modes

According to aforementioned method, for instance, scan_mode_0 could produce the shortest code length when the intra prediction result is Mode 0 (vertical). The experiment results are shown in TABLE 2. This table represents that when the intra prediction is given to the 4x4 block, each of scan modes is able to produce the probability of the shortest code length. Both scan_mode_0 and scan_mode_1, in general, produce efficient compression results, especially when intra prediction modes are Mode 5, Mode 6, Mode 7 and Mode 8; Moreover, the contribution of scan_mode_5, scan_mode_6, scan_mode_7 and scan_mode_8 to lossless compression algorithm achieves little.

Therefore in order to reduce hardware complexity and power consumption, moreover, enhance compression efficiency; the 4x4 intra prediction result is given to the algorithm to decide three scan modes among the four modes as shown in TABLE 3. Both scan_mode_0 and scan_mode_1 are always selected as the first and second modes. If intra prediction results are Modes 1, 2, 3, 7, and 8, then scan_mode_3 is selected as the third mode, or else the other intra prediction results are Modes 0, 4, 5, and 6, then scan_mode_4 is selected as the third mode. Such constitution is able to ensure

the probability of the shortest code length by more than 70%.

TABLE 2:

The probability of the shortest code length for each intra prediction mode Unit: (%)

scan 0 scan 1 scan 3 scan 4 scan 5 scan 6 scan 7 scan 8

Mode 0 65.41 7.42 4.73 5.58 7.24 2.44 5.22 1.96

Mode 1 17.09 51.48 6.69 5.18 4.41 6.45 2.92 5.78

Mode 2 36.17 18.04 12.46 9.03 8.13 5.44 6.10 4.63 Mode 3 25.97 18.15 30.94 4.39 4.71 3.03 6.57 6.24 Mode 4 32.65 13.28 4.32 28.40 10.58 5.91 2.82 2.05 Mode 5 54.68 4.85 2.23 19.79 13.66 2.07 1.75 0.98

Mode 6 18.48 35.73 5.06 19.22 5.95 9.43 2.43 3.70

Mode 7 46.86 9.35 19.84 3.09 4.82 2.14 10.59 3.30 Mode 8 25.46 30.01 18.74 3.90 4.36 4.08 5.34 8.11

TABLE 3:

Scan mode decision

Intra Prediction Result 0 1 2 3 4 5 6 7 8

Scan Mode Decision 0,1,4 0,1,3 0,1,3 0,1,3 0,1,4 0,1,4 0,1,4 0,1,3 0,1,3

3.1.2 Pixel-wise DPCM

A simple and well-known method for spatial prediction is to predict the present pixel value based on the pervious values and further encode prediction errors by entropy coder. This method is called DPCM. In general, best predictors are those form the neighboring pixels. In order to achieve coding efficiency for lossless compression, the selected scan orders are given to the next step that performs third DPCM operations along the selected scan orders.

The flowchart of the conventional DPCM is shown in FIG. 5. It has a disadvantage in hardware implementation as each prediction error taking one clock cycle is necessary; hence we propose a pixel-by-pixel (pixel-wise) DPCM that is more flexible in hardware design as illustrated in FIG. 6. The prediction errors of the proposed pixel-wise DPCM are still the same as of the conventional DPCM. In the high working frequency design, the architecture of the conventional DPCM might become critical path in the circuit. The proposed pixel-wise DPCM can not only be easily solved by increasing the parallelism of processing elements to be suitable for high working frequency design, but no major increase in computational complexity relative to the conventional DPCM process.

FIG. 5. The flowchart of the conventional DPCM for spatial prediction

FIG. 6. The flowchart of the proposed pixel-wise DPCM

FIG. 7. An example of pixel-wise DPCM along scan mode 1

The following is an example of the pixel-wise DPCM. Consider a 4x4 block given as shown in FIG. 7. Assume that the scan mode is 1. The first_pixel and the prediction errors diff1, diff2... and diff15 of each pixel P0, P1... and P15 are calculated by pixel-wise DPCM as follows:

The pixel-wise DPCM always requires each individual decoded pixel to be provided sequentially prior to the prediction and reconstruction of the next pixel value. As an example, in order to reconstruct pixel P3, pixel P2 should be reconstructed, in other words, pixel P0 and pixel P1 have been already reconstructed. Therefore we can reconstruct the 4x4 block from the prediction errors when scan mode is 1 as follows:

And it also could be expressed in matrix form to provide the high throughput architecture in hardware design as follows:

⎥⎥

The first_pixel is the entry pixel of a 4x4 block along the scan order. Suppose that the scan mode is 1, it is the P0 as shown in FIG. 7. The first_pixel should be compressed to achieve coding efficiency for lossless compression. In our proposed lossless compression algorithm, pixel-wise DPCM would involve neighboring pixel that is close to the first pixel and produce predicted error between first pixel and neighboring pixel as shown in FIG. 8.

FIG. 8. The allocation of first pixels and neighboring pixels in a macroblock

The H.264/AVC has a block-based coding structure. In its design, each picture is segmented into macroblocks, which consist of an array of 16x16 luma samples and two associated two arrays of 4x4 chroma samples. Each sample is further decomposed into several 4x4 blocks. In the H.264/AVC decoder, the block 0 that containing 16 pixels is reconstructed first for a macroblock.

Next, the other blocks 1-23 are reconstructed in the order as illustrated in FIG. 8.

In order to simplify hardware design, the first pixels and the neighboring pixels are located in the fixed position. Therefore, the first_pixel is placed on the up-left corner of a 4x4 block and the neighboring pixel that is chosen very close to the first pixel as shown in FIG. 8. According to experiment results, the pixel-wise DPCM involving neighboring pixel could significantly enhance compression efficiency by approximately 0.1 of compression ratio, but no major increase in hardware complexity and power consumption.

3.1.3 Golomb-Rice Coding and Segment Packing

Before talking about code length prediction in section 3.1.4, we should discuss Golomb-Rice coding first [22]-[23]. Although an arithmetic coding is the best-known lossless coding method, it requires expensive iterative coding and decoding procedures and consumes large computing power.

On the other hand, the DPCM output, prediction errors, diff_n are in [-255, 255]. If they are compressed by Huffman coding, Huffman code may result in a long codeword and computationally-intensive coding and decoding. Therefore we adopt Golomb-Rice coding to translate the prediction error into a codeword for the low-complexity design.

Golomb codes of parameter m encode a positive integer value by encoding value mod m in binary followed by an encoding of value div m in unary. When m = 2k the encoding procedure has a very simple realization and has been referred to as Rice coding in the literature, hence following we refer to them as Golomb-Rice codes. The Golomb-Rice coding accepts only a non-negative number as its input when a DPCM result can be a negative number. Therefore, a DPCM result is converted into a non-negative number as follows:

(1) Where diff represents a DPCM result and value represents the input to the Golomb-Rice coding.

The conversion back from value to diff is simple because the LSB of value indicates whether diff is a negative or not. The mapping value orders prediction residuals, interleaving negative values and positive values in the sequence 0, -1, 1, -2, 2, … If the values follow a Laplace distribution centered at zero, then the distribution of value will be close to (but not exactly) geometric, which can then be encoded using an appropriate Golomb-Rice code.

The key factor behind the effective use of Golomb-Rice codes is the estimation of the coding parameter k to be used for a given sample or block of samples. If k is smaller, the code length increase is too large for a large value. On the other hand, if k is greater, the code length is too large

for a small value. Weinberger's algorithm [24] exhaustively tries codes with each parameter on a block of samples and selects the one which results in the shortest code length. The coding parameter k is computed as follows:

}

(3)

2 |

min{ k

N

_value

A

_value

k = ′

^′

⋅ ≥

The k is estimated by maintaining in each context value, the count Nvalue of the number of times the context value has been encountered so far and A_value, the accumulated sum of magnitudes of prediction errors within this context value. This strategy is an approximation to optimal parameter selection for Golomb-Rice coding.

Based on aforementioned formulation applied in our proposed algorithm, Nvalue and Avalue

could be written as follows:

∑

= The coding parameter k could be computed as follows:

(6) Therefore, it also could be expressed in the following priority conditions in hardware implementation:

If the coding parameter k is more than six, the total code length after Golomb-Rice coding will exceed 128 bits limitation. Therefore k equal to six is the upper bound of the code length limitation.

After precise estimation of the coding parameter k, the total number of bits generated by Golomb-Rice coding for all prediction errors can be reduced.

The parameters and Golomb-Rice codes or 16 pixels of a block are packed as a small than or equal to 128-bit segment respectively. FIG. 9 (a) shows a compressed segment format and (b) an uncompressed segment format. In order to differentiate between compressed and uncompressed segments, we use one bit as a judgment (tag) and stored in the leftmost position, the four scan modes are coded with 2 bits and the 3-bit k is stored next. The first pixel requires 8 bits stored next to the k and the remaining bits store the 15 Golomb-Rice codes for the remaining pixels.

tag (1 bits)

scan_mode (2 bits)

k (3 bits)

*first_pixel (8 bits)

prediction errors after Golomb-Rice coding

(a) Total less than 128 bits

tag (1 bits)

16 pixels of a block

(b) Total 128 bits

FIG. 9. (a) A compressed segment format and (b) An uncompressed segment format

3.1.4 Code Length Prediction

In the section 3.1.3, we explicitly discuss the algorithm of Golomb-Rice coding. An example of Golomb-Rice codeword with k = 3 is shown in TABLE 4. Each value divided by 2^k produces a quotient and a remainder. Then Q indicated quotient is transformed into unary code and R indicated remainder is transform into fixed-length code.

TABLE 4:

An example of Golomb-Rice codeword with k = 3

value Q R codeword value Q R codeword

Hence we could find Golomb-Rice code is variable length code with a regular construction. It is constructed in a logical way:

[Prefix][1][Suffix]

Prefix has Q-bit leading zeros and Suffix has a k-bit remainder value. Therefore the code length of a Golomb-Rice code and of all prediction errors as follows:

Golomb k

The parameters for each 4x4 block are shown in TABLE 5. The fixed code length of all parameters is:

(10)

= 13

parameters

Length

TABLE 5:

The fixed code length of parameters for each 4x4 block

Parameters Type No. of bits

scan_mode

scan_mode_0 scan_mode_1 scan_mode_3 scan_mode_4

k 0~6 3

first_pixel unsigned 8

The pixel-wise DPCM would produce 3 kinds of prediction errors for given 3 scan modes.

Code length prediction could decide the shortest code length among prediction errors and further compressed by Golomb-Rice coding. If total code length of a 4x4 block exceeds 128 bits limitation, we will directly transfer the 4x4 pixels to the system bus. According to (9) and (10), the total code length limitation condition could be formulated as follows:

(11)

≥ 128 +

=

_parameters _prediction _errors

total

Length Length

Length

3.2 An Example of The Proposed Algorithm

Consider a 4x4 block given as shown in FIG. 10. Assume that the scan mode is 1. The compression result is given in TABLE 6 which tag equal to 1 is indicated that the total code length is less than 128 bits. 1 is chosen for the k-value. Note that k-value more than 6 requires 134 bits

在文檔中整合於H.264/AVC HDTV解碼器的無失真嵌入式壓縮方法 (頁 13-0)