Introduction - 適用於可調式小波視訊編碼之訊源機率模型與位元率-失真最佳化方法

Over the past few years, multimedia delivery becomes an important class of wireless/wired internet applications, for example, mobile video and digital TV broadcasting. To overcome the constraints on transmission bandwidth and receiver capability, the scalable coding technique was developed and adopted by the recent international video standards. There are two major approaches on scalable video coding: the DCT-based and the wavelet-based coding schemes. These two coding schemes share many similar coding concepts, especially in removing the temporal redundancy. The Scalable Video Coding (SVC) extension of the H.264/AVC is a representative scheme of the DCT-based approach and has been accepted as the ITU/MPEG standards in 2007 [1]. On the other hand, the wavelet-based coding scheme is a relatively new structure and has its potential and advantages [2] as shown during the MPEG competition process for standardization.

Discrete wavelet transform (DWT) has been successfully applied to still image compression. By exploiting the inter-subband or intra-subband correlation, the DWT transformed image signal can be efficiently compressed by a context-based entropy coder, such as EZW [3], SPIHT [4], and EBCOT [5]. Different from the DCT-based JPEG image coding, the multiresolution property of wavelet transform provides a natural way in producing scalable bitstreams. It enables the spatial and the SNR scalability features in the well-known JPEG2000 image coding standard [6]. In addition to the spatial decomposition,

- 2 -

DWT can also be applied along the temporal axis and decomposes video frames into temporal subband signals. Therefore, it provides the temporal scalability for videos. In the past fifteen years, the temporal wavelet decomposition is refined by adopting the motion compensated temporal filtering (MCTF) technique. These schemes were proposed and improved by Ohm [7], Hsiang and Woods [8], Secker and Taubman [9], and Xu et al. [10].

MCTF can efficiently decompose video frames along the motion trajectories. After MCTF and spatial 2-D DWT, the original video frames are transformed to spatio-temporal subband signals and compressed by a context-based entropy coder [9], [11]. This interframe wavelet video coding scheme can achieve temporal, spatial and SNR scalability goals simultaneously. Depending on the processing order in the spatio-temporal domain, the scalable wavelet coding methods can be classified to "t+2D" and "2D+t" structures [12]. In this study, we will focus on the t+2D structure.

The rate-distortion analysis of a scalable interframe wavelet video coder is very different from that of a DCT-based coder owing to the following two issues: inter-scale coding and open-loop coding structure. In DCT-based video coders, such as MPEG-2 or H.264, use the hybrid coding technique; all the temporal and spatial prediction operations are basically block-based. Thus, it is quite straightforward to perform the rate-distortion analysis along the coding operation flow. On the other hand, in the interframe wavelet coders, the temporal MCTF is performed block-wise, but the spatial entropy coding is performed on the

- 3 -

subbands. This inconsistent data partition increases the rate-distortion analysis difficulty drastically. Wang and Schaar proposed a solution in [13] to analyze the rate-distortion behavior across different coding scales for wavelet video coder. The second issue is that the DCT-based video coder has a closed-loop coding structure. The prediction errors within the loop can be controlled by adjusting coding parameters [14]; thus, the optimal rate-constrained motion compensation can be adaptively adjusted [15],[16]. But the interframe wavelet coding has an open-loop prediction structure and the quantization process is performed after all the encoding operations are completed. This open-loop scheme provides more flexibility on bitstream extraction and robustness to transmission errors, but it has no feedback path to provide useful information to adjust prediction parameters in the encoding process. Therefore, it is difficult to achieve the rate-distortion optimization target, especially in the case of allocating bits between the motion and the texture data at multiple operation points all at the same time. How to generate adequate amount of motion information and decide the best prediction modes for MCTF becomes a challenging problem in the scalable interframe wavelet video coding.

Our objective is to develop a rate-distortion optimization method to improve the coding performance of scalable wavelet video coding. For building an efficient rate-distortion model, we propose an accurate source model. Moreover, we also suggest a piecewise linear method to estimate the shape parameter of the Model. Besides, we derive an analytical

- 4 -

model that describes the trade-off between the motion compensation bits and the residual texture coefficients bits. We then allocate bits to each category properly at different scalability dimensions. We first examine the rate-distortion effect due to the increase or decrease of motion information bits. Then we derive a quantitative expression to measure the motion prediction efficiency. Most significantly, we give a theoretical explanation to this metric from the entropy viewpoint. Based on this finding, a new cost function is proposed. By minimizing the proposed cost function, the best prediction mode is decided and the corresponding motion vectors are chosen for the MCTF operation. Compared with the mode decision procedure in the conventional scalable wavelet video coder, the proposed method shows a PSNR improvement for the combined SNR and temporal scalability cases.

The proposed methods are also published in [38] and [39].

This thesis is organized as follows. Chapter 2 gives a brief review of interframe wavelet video and the rate-distortion mechanisms in video coding. In Chapter 3, the ρ-GGD source modeling is proposed to approximate the probability distribution of wavelet coefficients. In Chapter 4, we suggest the motion information gain (MIG) metric to measure the motion prediction efficiency. According to our source model, the MIG metric is further discussed from the entropy viewpoint. Extending the work in Chapter 3, the ρ-GGD source model is improved by an enhanced estimation method of the ρ value. The one-sided ρ-GGD is proposed for the texture residual signal in Chapter 5. In Chapter 6, the two concepts, MIG

- 5 -

in Chapter 4 and one-sided ρ-GGD in Chapter 5, are integrated into a complete and working algorithm. The major contributions in this thesis are listed as follows.

Contributions of this Study

(1) An accurate and efficient source model, ρ-GGD, is proposed to approximate the probability distribution of the wavelet coefficients.

(2) A quantitative metric, MIG, is proposed to measure the motion prediction efficiency of MCTF.

(3) Based on the MIG metric, a new rate-distortion cost function is proposed for mode decision. The parameters of the MIG cost function are empirically selected.

(4) To further improve the ρ-GGD model, the one-sided ρ-GGD model and an more reliable estimation method on ρ are proposed to approximate the probability distribution of residual texture signal.

(5) Based on MIG and one-sided ρ-GGD, an integrated MIG mode decision algorithm is developed. The parameters of the cost function are first theoretically derived and then fine-tuned by experimental data.

- 6 -

Chapter 2 Scalable Wavelet Video

在文檔中適用於可調式小波視訊編碼之訊源機率模型與位元率-失真最佳化方法 (頁 11-16)