可調視訊編碼之高等細緻可調性研究

(1)

國

立

交

通

大

學

電子工程系電子研究所

博士論文

可調視訊編碼之高等細緻可調性研究

Scalable Video Coding

Advanced Fine Granularity Scalability

研究生：彭文孝

(2)

(3)

可調視訊編碼之高等細緻可調性研究

Scalable Video Coding – Advanced Fine Granularity Scalability

研究生：彭文孝 Student：Wen-Hsiao Peng

指導教授：蔣迪豪、李鎮宜 Advisor：Dr. Tihao Chiang & Dr. Chen-Yi Lee

國立交通大學

電子研究所

博士論文

A Thesis

Submitted to Institute of Electronics College of Electrical and Computer Engineering

National Chiao Tung University in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Electronics Engineering December 2005

Hsinchu, Taiwan, Republic of China

(4)

(5)

可調視訊編碼之高等細緻可調性研究

學生：彭文孝

指導教授：蔣迪豪教授

李鎮宜教授

國立交通大學電子研究所博士班

摘要

為了在異質而多樣的環境下進行視訊串流/廣播，MPEG-4 定義了一個細緻

可調視訊編碼。透過截斷位元流的方式，

MPEG-4 可調視訊編碼可以優雅的

降低視訊品質。儘管目前的編碼演算法提供了很好的細緻可調性，可是現

有的方法卻有較差的編碼效率和較差的主觀視覺效果。

本論文從預估器，位元層編碼，到傳送順序提供一個整合的方式來改進

目前

MPEG-4 細緻可調視訊編碼。具體而言，我們提出了一個增強型適應

性細緻可調視訊編碼和一個內文適應性位元層編碼來達到較高的編碼效

率。更近一步的，從內文適應性位元層編碼架構，我們開發了一個估測性

位元重排技術用以改善主觀的視覺效果。

為了提高編碼效率，我們的增強型適應性細緻可調視訊編碼同時藉由基

本層和增強層來建構一組較佳的預估器。為了最小化可能的漂移錯誤，我

們在編碼端利用一個多餘的預估回圈來模擬在解碼端的漂移行為。接著，

在一個理論基礎的導引下，我們在不同的估測模式中切換，使得能在最小

(6)

適應性細緻可調視訊編碼可以有超過

1~1.5dB 的 PSNR 效能改進。

除了建構較佳的預估器，我們也使用一個內文適應模式和位元依序方法

來編碼增強層，進而改進編碼效率。為了充分利用已存在的相關性，內文

參照模型的設計同時利用了單一轉換方塊中的能量分佈和相鄰轉換方塊中

的空間相關性。同時，跨位元層的相關性也被用來減少附屬資訊。比起

MPEG-4 的位元層編碼，我們的內文適應性位元層編碼更進一步的達到

0.5~1dB 的 PSNR 改善。

透過以位元為單位的運作，估測性位元重排技術藉由一個能依據視訊內

容來精化基本層的方式來改善主觀視覺效果。具體而言，我們重排在增強

層的係數位元使得較多能量的區域被指定較高的精化優先權。特別的是，

為了避免傳送實際的編碼順序，精化優先權由一個使用最大可能性原理所

推導的模型來決定。比起使用固定的掃描方式，我們的估測性位元重排技

術提供了較佳的主觀視覺效果，並且保有近似或更高的編碼效率。

總結，本論文證明了目前

MPEG-4 細緻可調視訊編碼的壓縮效能和主觀

視覺效果可以被顯著改善。細緻可調視訊編碼和非可調編碼的效能差距可

以被縮短。此外，所提的技術可以運用在未來的可調視訊編碼標準。

(7)

-iii-

Scalable Video Coding –

Advanced Fine Granularity Scalability

Student：Wen-Hsiao Peng

Advisors： Dr. Tihao Chiang

& Dr. Chen-Yi Lee

Institute of Electronics

National Chiao Tung University

ABSTRACT

For the video streaming/broadcasting in a heterogeneous environment, MPEG-4 defines the fine granularity scalability (FGS), which offers graceful degradation of visual quality through the bit-stream truncation. While offering good scalability at fine granularity, current approach suffers from poor coding efficiency and subjective quality.

This dissertation provides an integrated solution, from prediction, bit-plane coder, and coding order, to improve MPEG-4 FGS. Specifically, we propose an enhanced mode-adaptive FGS (EMFGS) algorithm and a context-adaptive bit-plane coder (CABIC) to deliver higher coding efficiency. Further, based on the CABIC framework, we develop a stochastic bit reshuffling (SBR) technique to achieve better subjective quality.

For higher coding efficiency, our EMFGS constructs better enhancement-layer predictors from both the base layer and the enhancement layer. To minimize possible drifting errors, the EMFGS encoder employs a dummy prediction loop to simulate drifting behavior in the decoder. Then, under the guidance of a theoretic framework, the prediction is switched among different predictors such that the coding efficiency can be maximized with minimized drifting

(8)

In addition to constructing better predictors, our CABIC also improves the coding efficiency by coding the enhancement layer in a context-adaptive and bit-by-bit manner. To fully utilize the existing correlations, the context models are designed based on both the energy distribution in a transform block and the spatial correlations in the adjacent blocks. Moreover, the context across bit-planes is exploited to save side information. Comparing with the bit-plane coding in MPEG-4 FGS, our CABIC scheme further achieves a PSNR gain of 0.5~1.0dB.

Through the bit-wise operation, the SBR improves the subjective quality by refining the base layer in a content-aware manner. Specifically, the coefficient bits at the enhancement layer are reshuffled such that the regions containing more energy are assigned with higher priority for refinement. Particularly, to prevent the exact coding order from transmission, a model-based approach, derived from maximum likelihood principle, is used to decide the coding priority. Comparing with the approaches with deterministic coding order, our SBR provides better visual quality and maintains similar or even higher coding efficiency.

In conclusion, this work proves that MPEG-4 FGS can be significantly improved in coding efficiency and subjective quality. The performance gap between the non-scalable codec and the scalable one can be reduced. Moreover, the proposed schemes can be applied in the upcoming standard for scalable video coding.

(9)

List of Tables

2.1 Pseudo code of the bit-plane coding in MPEG-4 FGS . . . 14

2.2 List of (Run, EOP) symbols and sign bits for the example in Figure 2.4 . . . 15

3.1 Formulas of the proposed macroblock predictors . . . 28

3.2 Formula of the generalized Type BE predictor . . . 28

3.3 Formulas of the dummy predictors and dummy reference frames . . . 32

3.4 Best predictor in different scenarios . . . 38

3.5 Configurations of switch M1 for mode-adaptive prediction . . . 40

3.6 Summary of extra complexity in the EMFGS decoder . . . 45

3.7 Summary of extra complexity in the PFGS decoder . . . 45

3.8 Summary of extra complexity in the RFGS decoder . . . 46

3.9 Typical alpha values used in RFGS algorithm . . . 47

3.10 Testing conditions for comparing FGS algorithms . . . 49

4.1 Probability for the EOSP of a bit-plane being in Part I . . . 61

4.2 Context model of the significant bit . . . 62

4.3 Pseudo code of the proposed CABIC . . . 66

4.4 Pseudo code for coding the significant bits in a bit-plane . . . 67

(14)

4.6 Average execution time for the bit-plane encoding of an enhancement-layer

frame on P4 2.0GHz machine . . . 83

4.7 Testing conditions for comparing bit-plane coding schemes . . . 84

5.1 Inter-layer prediction modes for an inter macroblock . . . 104

5.2 Luminance priority table . . . 112

5.3 Assignment of Enable MB Coding flags . . . 117

(15)

List of Figures

1.1 Application framework for scalable video coding. . . 2

1.2 Prediction structure of MPEG-4 FGS. . . 3

1.3 Example of fine granular SNR scalability. . . 3

1.4 Comparison between simulcast and MPEG-4 FGS in terms of channel band-width and quality variation. . . 4

1.5 Prediction structure of enhanced mode-adaptive FGS. . . 5

1.6 Example of partial refinement due to the truncation of the enhancement layer. . 7

2.1 System block diagram of MPEG-4 FGS encoder. . . 12

2.2 System block diagram of MPEG-4 FGS decoder. . . 13

2.3 Example of bit-plane coding at the enhancement layer. . . 13

2.4 Example of bit-plane coding in a transform block. . . 15

2.5 Example of bit-plane coding with frequency weighting. . . 16

2.6 Example of bit-plane coding with selective enhancement. . . 17

3.1 Comparion of rate-distortion performance for non-scalable codec, MPEG-4 FGS, and advanced FGS algorithms. . . 21

3.2 An end-to-end transmission model for the analysis of drifting error in the en-hanced mode-adaptive FGS algorithm. . . 24

(16)

3.3 A qualitative measure between prediction residue and mismatch error. . . 27

3.4 Predictor mismatch error versus frame index (i.e., time). . . 29

3.5 Analysis of drifting errors using Coastguard CIF sequence. The enhancement layer used for prediction = 512kbits/s and the decoded video is with the en-hancement layer truncated at 64kbits/s. Note that all inter macroblocks are forced to be predicted in Type BE mode. Y-axis = PSNR with generalized Type BE mode - PSNR of MPEG-4 FGS. . . 29

3.6 Visual comparison of the 141th frame in Coastguard CIF sequence with the en-hancement layer truncated at 64kbits/s. The enen-hancement layer used for predic-tion = 512kbits/s and the decoded video is with the enhancement layer truncated at 64kbits/s. Note that all inter macroblocks are forced to be predicted in Type BE mode. (a) MPEG-4 FGS (Alpha of Type BE = 0). (b) Alpha of Type BE = 1. (c) Alpha of Type BE = 0.9. (d) Alpha of Type BE = 0.75. . . 30

3.7 Distribution of prediction modes in the CIF sequences of 30Hz and the QCIF seqeunces of 15Hz. . . 34

3.8 Distribution of prediction modes in the CIF sequences of 10Hz and the QCIF seqeunces of 5Hz. . . 35

3.9 Variation of prediction mode distribution when the frame rate is lowered. For the CIF sequences, the frame rate is decreased from 30Hz to 10Hz. For the QCIF sequences, the frame rate is decreased from 15Hz to 5Hz. . . 35

3.10 Prediction mode distribution of Akiyo sequence in CIF resolution. (a) Base-layer Qp = 31 and frame rate = 30Hz. (b) Base-Base-layer Qp = 31 and frame rate = 10Hz. (c) Base-layer Qp = 15 and frame rate = 30Hz. (d) Base-layer Qp = 15 and frame rate = 10Hz. . . 36

3.11 Prediction mode distribution of Coastguard sequence in CIF resolution. (a) Base-layer Qp = 31 and frame rate = 30Hz. (b) Base-layer Qp = 31 and frame rate = 10Hz. (c) Base-layer Qp = 15 and frame rate = 30Hz. (d) Base-layer Qp = 15 and frame rate = 10Hz. . . 37

3.12 System block diagram of the proposed EMFGS encoder. . . 40

3.13 System block diagram of the proposed EMFGS decoder. . . 42

3.14 System block diagram of PFGS decoder. [33] . . . 43

(17)

LIST OF FIGURES 3.16 Luminance PSNR comparison of PFGS, RFGS, and MPEG-4 FGS using the

sequences in CIF resolution and at 10 Hz. (a) Coastguard. (b) Foreman. (c)

Table tennis. . . 50

3.17 Luminance PSNR comparison of PFGS, RFGS, and MPEG-4 FGS using the sequences in QCIF resolution and at 10 Hz. (a) Coastguard. (b) Foreman. (c) Table tennis. . . 51

3.18 Luminance PSNR comparison of PFGS, RFGS, and MPEG-4 FGS using the sequences in CIF resolution and at 30 Hz. (a) Coastguard. (b) Foreman. (c) Table tennis. . . 52

4.1 Terminologies of MSB bit-planes and MSB bit. . . 58

4.2 Coding flow of context-adaptive binary arithmetic coding. . . 59

4.3 Example of bit classification and bit-plane partition in a transform block. . . 60

4.4 Example of the context model for the significant bit. The transform block is with 4x4 integer transform. . . 63

4.5 Luminance PSNR comparison of the proposed CABIC scheme with different transforms at the enhancement layer. (a) Foreman sequence in CIF resolution and at 10 Hz. (b) Mobile sequence in CIF resolution and at 10Hz. . . 65

4.6 Probability distributions of the 4x4 integer transform coefficients. The leg-end ZZn denotes the zigzag index of a coefficeint and the KLD stands for the Kullback-Leibler distance. (a) Actual probability distributions. (b) Estimated probability models. . . 71

4.7 Examples of ∆D estimation for the significant bit and the refinement bit. . . . 72

4.8 Example of dynamic bit reshuffling in a bit-plane. . . 77

4.9 Comparison of different coding orders on the number of non-zero significant bits among the same number of coded significant bits. . . 78

4.10 Example of dynamic memory organization for the list of significant bit. . . 81

4.11 Luminance PSNR comparison of different bit-plane coding schemes using the sequences in CIF resolution. (a) Foreman. (b) Mobile. (c) News. . . 86

4.12 Luminance PSNR comparison of different bit-plane coding schemes using the sequences in QCIF resolution. (a) Foreman. (b) Mobile. (c) News. . . 87

(18)

4.13 Subjective quality comparison of the 94th frame in Foreman CIF sequence with bit rate at 255 kbits/s: (a) baseline (VLC-based bit-plane coding in MPEG-4 FGS), (b) baseline with frequency weighting, (c) the proposed CABIC with frame raster and coefficient zigzag scanning, and (d) the proposed CABIC with

SBR. . . 88

4.14 Subjective quality comparison of the 117th frame in Mobile CIF sequence with bit rate at 420 kbits/s. (a) Baseline (VLC-based bit-plane coding in MPEG-4 FGS). (b) Baseline with frequency weighting. (c) The proposed CABIC with frame raster and coefficient zigzag scanning. (d) The proposed CABIC with SBR. 89 4.15 Comparison of the 94th enhancement-layer frame in Foreman CIF sequence with bit rate at 255 kbits/s. (a) Baseline (VLC-based bit-plane coding in MPEG-4 FGS). (b) Baseline with frequency weighting. (c) The proposed CABIC with frame raster and coefficient zigzag scanning. (d) The proposed CABIC with SBR. 90 4.16 Comparison of the 39th enhancement-layer frame in Mobile CIF sequence with bit rate at 420 kbits/s. (a) Baseline (VLC-based bit-plane coding in MPEG-4 FGS). (b) Baseline with frequency weighting. (c) The proposed CABIC with frame raster and coefficient zigzag scanning. (d) The proposed CABIC with SBR. 91 4.17 Comparison of different slice structures. (a) Fixed partition structure. (b) Flex-ible slice structure. . . 92

4.18 Flexible slice structure with fine granularity. . . 92

4.19 Error propagation in fixed partition structure and flexible slice structure. . . 93

4.20 Luminance PSNR comparison with periodic byte error. . . 94

4.21 Subjective quality comparison of the 62th frame in Foreman CIF sequence. (a) Flexible slice structure. (b) Fixed partition structure. . . 94

4.22 Luminance PSNR comparison with periodic byte error. . . 95

4.23 Subjective quality comparison of the 62th frame in Foreman CIF sequence. (a) Flexible slice structure. (b) Fixed partition structure. . . 95

5.1 Encoder block diagram of the scalable video coding standard. [25] . . . 99

5.2 Lifting scheme for the (5, 3) wavelet transform. (a) Decomposition. (b) Recon-struction. [25] . . . 100

5.3 MCTF structure for the (5, 3) wavelet. . . 102

(19)

LIST OF FIGURES

5.5 Structure of inter-layer prediction. . . 103

5.6 Extension of prediction/partition information for the inter-layer prediction. . . 104

5.7 Illustration of cyclical block coding. . . 105

5.8 Example of combining spatial, temporal, and SNR scalability. . . 106

5.9 Proposed prediction structure for the anchor frames in the SVC standard. . . 108

5.10 Symbolic representation of cyclical block coding. . . 109

5.11 Symbolic representation of prioritized block coding. . . 109

5.12 Example for calculating the energy density in a transform block. . . 111

5.13 Coding flow of the prioritized block coding. . . 112

5.14 Example of the threshold determination process. . . 113

5.15 Prioritized block coding for the functionality of region-of-interest. . . 115

5.16 Prioritized block coding with the alternative assignment for Enable MB Coding flags. . . 119

5.17 Partition of region-of-interest for graceful selective enhancement. . . 120

5.18 Regional PSNR comparison. (a) Comparison of foreground region. (b) Com-parison of background region. . . 121

5.19 Subjective quality comparison between cyclical block coding and the priotrized block coding with layer remapping. . . 122

6.1 PSNR comparison of the scalable algorithms, including the proposed schemes and MPEG-4 FGS, and the non-scalable H.264. (a) Foreman sequence. (b) Mobile sequence. . . 126

6.2 Visual comparison of the proposed algorithms (including EMFGS, CABIC, and SBR) and MPEG-4 FGS at 255kbits/s. (a) The proposed schemes. (b) The MPEG-4 FGS. . . 127

6.3 Comparison of rate-distortion performance for non-scalable codec, the EMFGS with 1 enhancement-layer encoder, and the EMFGS with 2 enhancement-layer encoders. . . 128

6.4 Stack structure of EMFGS encoder. . . 129

(20)

M Cthxi Operation of motion compensation at time instance t IQ.Q_hxi Operations of quantization and inverse quantization T urn._{hxi = b}x Operation of truncation

kx − yk Sum of absolute difference between x and y

δ[n] = (n == 0)?1 : 0 Discrete impulse function μ[n] = (n_{≥ 0)?1 : 0} Discrete step function

e

E[_·] Taking estimation

Var[·] Variance function

E[_·] Mean function

P [_·] Probability distribution function

Hb(·) Binary entropy function

SignificantContextP(·) Context probability model of a significant bit

IB(t) The reconstructed base-layer frame

(21)

LIST OF NOTATIONS e

Io(t) The decoded frame at decoder

IE(t) The reconstructed enhancement-layer frame at encoder

e

IE(t) The reconstructed enhancement-layer frame at decoder

PE(t) The enhancement-layer predictor at encoder

e

PE(t) The enhancement-layer predictor at decoder

(t) The enhancement layer produced by encoder

e(t) The enhancement layer received by decoder

b(t) The enhancement layer used for the prediction at encoder

be(t) The enhancement layer used for the prediction at decoder

d(t) Transmission error

α The fading factor for the generalized Type BE predictor

σn Laplacian parameter for the n-th transform coefficient

∆D Reduction of distortion contributed by a coefficient bit

(22)

© Copyright 2005 by Wen-Hsiao Peng

All Rights Reserved

(23)

CHAPTER 1 Introduction

1.1 Overview of Dissertation

Scalable video coding (SVC) attracts wide attention with the rapid growth of multimedia appli-cations over Internet and wireless channels. In such appliappli-cations, the video may be transmitted over error-prone channels with fluctuated bandwidth. Moreover, the clients, consisting of dif-ferent devices, may have difdif-ferent processing power and spatial resolutions. To produce a single bit-stream for versatile purposes, the idea of SVC is proposed. The following summarizes the major applications of SVC and Figure 1.1 depicts a framework for the applications of SVC.

• Internet video streaming/broadcasting, • Mobile streaming/broadcasting, • Mobile interactive applications, • Surveillance,

• Video archiving,...and so on.

To serve the video streaming in a heterogeneous environment, the simulcast, which directly compresses the video into multiple bit-streams with distinct bit rates, is one of the most institu-tive ways to achieve the goal. According to the available channel bandwidth, the transmission

(24)

Ethernet Ethernet Server Wireless Point-to-Point Transmission Broadcasting Router Wireless 512 kbps 32 kbps 128 kbps 256 kbps 64 kbps 3 Mbps 1.5 Mbps 384 kbps 64 kbps Bandwidth Time

Figure 1.1: Application framework for scalable video coding.

can be switched among the bit-streams that are coded at different bit rates. Particularly, to avoid drifting errors, a switching frame can be periodically coded as a transition frame from one bit-stream to another bit-bit-stream [12]. Since the simulcast encodes the video into a limited number of bit-streams, it cannot provide graceful variation of quality while the channel bandwidth fluc-tuates. Furthermore, from the compression perspective, the simulcast is not efficient because the existing correlations among the bit-streams are not fully utilized.

1.1.1 MPEG-4 Fine Granularity Scalability

To offer graceful variation of quality, MPEG-4 streaming video profile [15] defines the fine granularity scalability (FGS), which provides a DCT-based quality (SNR) scalability using the layered approach. Specifically, the video is compressed into a base layer and an enhancement layer. The base layer offers a minimum guaranteed visual quality. Then the enhancement layer

(25)

Chapter 1. Introduction

I-frame P-frame P-frame P-frame

Base Layer: Non-scalable Codec Enhancement Layer: Embedded Bit-Plane Coding

Truncated Part Truncated Part

Figure 1.2: Prediction structure of MPEG-4 FGS.

40kbits/s 168kbits/s 296kbits/s

Enhancement-Layer

Base-Layer

Figure 1.3: Example of fine granular SNR scalability.

refines the quality over that offered by the base layer. Currently, the base layer is coded by a non-scalable codec using conventional closed-loop prediction. On the other hand, the enhancement layer is coded by an embedded bit-plane coding with an open-loop prediction. Figure 1.2 shows the prediction scheme of MPEG-4 FGS [15].

Having the open-loop prediction and embedded bit-plane coding, the enhancement layer can be arbitrarily truncated and dropped for the adaptation of channel bandwidth and processing power. At decoder side, the video quality depends on how much enhancement layer is received and decoded. An example of such quality (SNR) scalability is presented in Figure 1.3. As shown, the base layer provides a rough representation; as more enhancement layers are received, the decoded quality is gradually improved.

(26)

Quality (PSNR) Channel Bandwidth Simulcast MPEG-4FGS Bit-stream 1 Bit-stream 2 Bit-stream 3 Non-Scalable 2~3dB

Figure 1.4: Comparison between simulcast and MPEG-4 FGS in terms of channel bandwidth

and quality variation.

channel bandwidth. With a limited number of bit-streams, the quality variation provided by the simulcast is in a stepwise manner. Particularly, the number of quality levels is determined by the number of pre-encoded bit-streams. On the other hand, MPEG-4 FGS [15] provides an infinite number of quality levels through the truncation of the enhancement-layer bit-stream. According to the fluctuation of channel bandwidth, MPEG-4 FGS [15] can offer smooth variation of visual quality.

While offering good scalability at fine granularity, the compression efficiency of MPEG-4 FGS [15] is often much lower than that of a non-scalable codec. Averagely, at the same bit rate, a PSNR loss of 2∼3dB or more is observed, as presented in Figure 1.4. The PSNR loss comes from the fact that the enhancement layer is simply predicted from the base layer. As shown in Figure 1.3, the base layer is mostly encoded at very low bit rate, and the base-layer frames often have poor visual quality. Since the predictor of poor quality cannot effectively remove the redundancy, the coding efficiency is inferior.

1.1.2 Enhanced Mode-Adaptive Fine Granularity Scalability

To improve the coding efficiency, we try to find a better predictor by using the enhancement layer. In addition to the macroblock from the base-layer frame (Type B), we additionally con-struct two macroblock predictors from the previously reconcon-structed enhancement-layer frame (Type E) and the average of Type B and Type E (Type BE). These predictors are adaptively used to minimize the prediction residue. For example, because the base-layer frames are compressed

(27)

Chapter 1. Introduction I-frame Base Layer Enhancement Layer P-frame Type E Type B Type BE P-frame Type E Type B Type BE P-frame Type E Type B Type BE Truncated Part

Truncated Part Truncated Part Truncated Part

Figure 1.5: Prediction structure of enhanced mode-adaptive FGS.

at worse quality, the enhancement-layer frames with motion compensation generally provide better quality, and thus Type E can be used to improve the quality of predictor. On the other hand, Type B is useful for the regions where motion estimation cannot efficiently reduce the inter-frame correlation, e.g., fast-motion region, occlusion region, etc. Additionally, Type BE mode can improve the coding efficiency by taking the best of Type B and Type E. Figure 1.5 depicts the prediction structure of our enhanced mode-adaptive FGS algorithm (EMFGS). As compared to the prediction structure of MPEG-4 FGS [15] in Figure 1.2, the EMFGS provides better coding efficiency by using a closed-loop prediction at the enhancement layer.

Although the coding efficiency can be improved by using the enhancement layer for pre-diction, drifting errors could occur at lower bit rate. As shown in Figure 1.5, a closed-loop prediction is introduced at the enhancement layer. During the transmission, the enhancement layer is not guaranteed to be received in an expected manner. Therefore, the predictor mismatch between the encoder and decoder would produce drifting errors.

To minimize drifting errors, we create an adaptive mode-selection algorithm, in the encoder, which first smartly estimates the possible drifting errors in the decoder and then chooses the best macroblock mode wisely. Particularly, we show that the Type BE predictor can be generalized to reduce drifting errors, and the Type B predictor can completely stop drifting errors. To stop/reduce drifting errors, we adaptively use Type B and Type BE predictors to offer reset and fading mechanisms.

As compared to other advanced FGS schemes [10][32][33], our EMFGS algorithm shows a PSNR improvement of 0.3∼0.5dB with a less complex structure. While comparing to MPEG-4 FGS [15], more than 1∼1.5dB improvement can be gained.

(28)

1.1.3 Context-Adaptive Bit-Plane Coding

In addition to constructing better predictors, we also propose a context-adaptive bit-plane coding (CABIC) to improve the coding efficiency of the enhancement layer.

Currently, in MPEG-4 FGS [15], the bit-plane coding at the enhancement layer is performed from the most significant bit-plane to the least significant one. For each bit-plane, the coding is conducted in a frame raster and coefficient zigzag scanning manner. Further, in each transform block, the coefficient bits are represented by (Run, EOP) symbols and coded with Huffman tables.

While offering good embedded property, current approach suffers from poor coding ef-ficiency and subjective quality. The poor coding efef-ficiency is contributed by three factors. Firstly, information with different weighting is jointly grouped by (Run, EOP) symbols and coded without differentiation. Secondly, existing correlations across bit-planes and among spa-tially adjacent blocks are not fully exploited. Lastly, the Huffman tables have limitation for adapting to the statistic of different sequences.

In addition, the frame raster and coefficient zigzag scanning causes poor subjective qual-ity. Since the enhancement layer could be truncated during the transmission, the frame raster scanning may only refine the upper part with one extra bit-plane. Therefore, the lower part of a decoded frame is normally with worse quality. Such uneven refinement causes degradation of subjective quality. An example of partial refinement is illustrated in Figure 1.6.

To improve the coding efficiency, our CABIC incorporates a context-adaptive binary arith-metic codec. The bit-planes are coded in a context-adaptive, bit-by-bit manner. To distinguish coefficient bits of different importance, we classify the coefficient bits into different types. For each type of bits, the context model is designed by different sources of correlations. Further-more, to fully utilize the existing correlations, both the energy distribution in a block and the spatial correlations in the adjacent blocks are considered in our context models. Also, we ex-ploit the context across bit-planes to save side information and use estimated Laplacian models to maximize the efficiency of binary arithmetic codec.

Furthermore, to improve the subjective quality, we develop a stochastic bit reshuffling (SBR) scheme, which refines the base layer in a content-aware manner. Instead of using a determinis-tic coding order, our SBR employs a dynamic order to refine the regions of higher energy with higher priority. To achieve this, each coefficient bit is assigned with a distortion reduction ∆D, and a coding cost ∆R. With such information, the coefficient bits at the enhancement layer

(29)

Chapter 1. Introduction Truncated MSB Bit-Plane LSB Bit-Plane Truncated Raster Scanning Blocking Effect Reconstruction

Figure 1.6: Example of partial refinement due to the truncation of the enhancement layer.

are reordered in a way that the associated (∆D/∆R) is in descending order. Particularly, to prevent the exact coding order from transmission, both encoder and decoder model the trans-form coefficients with discrete Laplacian distributions and incorporate them into the context probability models for the content-aware parameter estimation. In our scheme, the overhead is minimized since the coding order is implicitly known to both sides. Moreover, the bit reshuf-fling is conducted in a content-aware manner because our parameter estimation considers the energy distribution in spatial domain by referring to context probability models.

As compared to the bit-plane coding in MPEG-4 FGS [15], our CABIC improves the PSNR by 0.5∼1.0dB at medium and high bit rates. While maintaining similar or even higher coding efficiency, our SBR significantly improves the subjective quality.

In summary, as compared to MPEG-4 FGS [15], our EMFGS together with CABIC can provide 2∼3dB PSNR improvement. On top of that, our SBR can significantly improve the subjective quality.

1.1.4 Applications in Scalable Video Coding Standard

Although the proposed EMFGS and SBR are mainly developed for MPEG-4 FGS [15], in this thesis, we also show that these techniques can be applied in the upcoming MPEG standard for scalable video coding (SVC) [23].

Specifically, our EMFGS can be used to improve the coding efficiency of anchor frames. In SVC [23], the anchor frames and their enhancement layer are coded in a way similar to that used by MPEG-4 FGS [15]. As it has been proved, using the base layer as the only predictor

(30)

of the enhancement layer leads to poor coding efficiency. Thus, the techniques employed in the EMFGS can be applied for the coding of anchor frames. Moreover, we will show that such an application is more important in the low-delay applications, in which the coding efficiency of anchor frames is more critical to the overall performance.

In addition, we also demonstrate that the idea of SBR can be extended for coding the FGS layers. Currently, the FGS layers are coded by a cyclical block coding [26]. Each block is equally coded with one symbol in a coding cycle. Through the concept of SBR, a prioritized block coding is proposed to have the symbols with better rate-distortion performance be coded with higher priority. Also, by using explicit syntax for the priority information, the prioritized block coding can also serve the purpose of region-of-interest functionality.

1.2 Organization and Contribution

In this thesis, we develop an enhanced mode-adaptive FGS (EMFGS) algorithm to deliver higher coding efficiency with limited drifting errors. Moreover, we propose a context-adaptive bit-plane coding (CABIC) with a stochastic bit reshuffling (SBR) scheme to further improve the subjective quality. Although the proposed schemes are mainly developed for MPEG-4 FGS [15], we also show that these techniques can be extended and tailored for the upcoming SVC standard [23]. For more details of each part, the rest of this thesis is organized as follows:

• Chapter 2 details the algorithm of MPEG-4 FGS [15].

• Chapter 3 shows the problem of MPEG-4 FGS [15] and presents the EMFGS algorithm, which is used for improving the predictor. Specifically, our contributions in the work of EMFGS include the following:

– We adaptively construct macroblock predictors from the previous

enhancement-layer frame, the current base-enhancement-layer frame, and the combination of both. We offer two new Type E and Type BE predictors in additional to the original Type B predic-tor.

– While two of the three modes can improve coding efficiency, two of the three modes

can reduce drifting errors. In particular, we adaptively use Type E and Type BE predictors to increase the prediction efficiency. And, we adaptively, at macroblock level, enable reset mechanism with Type B predictor and fading mechanism with Type BE predictor.

(31)

Chapter 1. Introduction

– We create a dummy reference frame in the encoder to ”accurately” model the

drift-ing errors for each macroblock, and thus, our best predictor selection is accorddrift-ing to the ”improvement gain” and the ”drifting loss”.

– As compared to other advance FGS schemes [10][32][33], our algorithm shows 0.3

∼0.5dB PSNR improvement with a less complex structure. While compared to MPEG-4 FGS [15], more than 1∼1.5dB PSNR improvement can be gained.

• Chapter 4 illustrates the problem of current bit-plane coding scheme and shows the pro-posed CABIC and SBR algorithms, which are used for the improvement of coding effi-ciency and subjective quality. Our contributions in the works of CABIC and SBR include the following:

– We construct context models based on both the energy distribution in a block and

the spatial correlations in the adjacent blocks. Moreover, we employ context across bit-planes to save side information.

– We use estimated Laplacian distributions to model transform coefficients and exploit

maximum likelihood estimators to minimize overhead.

– While maintaining similar or even higher coding efficiency, our SBR dynamically

reorders the coefficient bits so that the regions containing more energy are coded with higher priority.

– For the implementation of SBR, we develop a priority management scheme based

on a dynamic memory organization.

– While comparing to the bit-plane coding in MPEG-4 FGS [15], our CABIC further

improves the PSNR by 0.5∼1.0dB at medium/high bit rates. With similar or even higher coding efficiency, our SBR significantly improves the subjective quality. • Chapter 5 describes the applications of our proposed techniques in the SVC standard [23].

Specifically, we have shown that

– The EMFGS can be used to improve the coding efficiency of anchor frames. – The SBR can be used to improve the coding efficiency of FGS layers.

– The SBR can be extended to provide a graceful region-of-interest functionality with

arbitrary shape.

(32)

MPEG-4 Fine Granularity Scalability

2.1 Introduction

To serve the video applications in a heterogeneous environment, MPEG-4 streaming video pro-file [15] defines the fine granularity scalability (FGS), which provides a DCT-based quality (SNR) scalability using the layered approach. Specifically, the video is compressed into a base layer and an enhancement layer. The base layer offers a minimum guaranteed visual quality. Then the enhancement layer refines the quality over that offered by the base layer. Currently, the base layer is coded by a non-scalable codec using a closed-loop prediction. On the other hand, the enhancement layer is coded by an embedded bit-plane coding using an open-loop prediction and variable length code (VLC).

Having the open-loop prediction and embedded bit-plane coding, the enhancement layer can be arbitrarily truncated and dropped for the adaptation of channel bandwidth and processing power. At decoder side, the video quality depends on how much enhancement layer is received and decoded. While more enhancement layers are received, the decoded quality is gradually improved.

(33)

Chapter 2. MPEG-4 Fine Granularity Scalability organized as follows: Section 2.2 presents the encoder and decoder architectures of MPEG-4 FGS [15]. Section 2.3 elaborates the detail of bit-plane coding. In addition, Section 2.MPEG-4 illustrates the tools for subjective quality improvement. Section 2.5 introduces the techniques for error resilience. Lastly, Section 2.6 summarizes this chapter.

2.2 Structure of MPEG-4 Fine Granularity Scalability

2.2.1 Encoder

MPEG-4 FGS [15] compresses the video into a base layer and an enhancement layer. Thus, the encoder of MPEG-4 FGS [15] consists of two parts, which are the base-layer encoder and the enhancement-layer encoder. Figure 2.1 shows the FGS encoder structure [15]. Like a non-scalable encoder, the base-layer encoder incorporates a closed-loop prediction to remove temporal redundancy. Particularly, in MPEG-4 FGS [15], the base-layer is implemented by the advanced simple profile of MPEG-4. However, other non-scalable encoder such as H.264 [31] can also be adopted as base layer.

On the top of the base-layer encoder, the reconstructed base-layer frame is subtracted from the original frame and the residue is coded by the enhancement-layer encoder. To remove spatial redundancy, the residue is first transformed with 8x8 DCT. Then the DCT coefficients are coded by an embedded bit-plane coding to achieve fine granular SNR scalability.

In particular, a bit-plane shifter is placed prior to the bit-plane coding for prioritizing the coding of DCT coefficients and macroblocks. As it will be described in more detail, the purpose of bit-plane shifting is to improve subjective quality and provide region-of-interest functionality.

2.2.2 Decoder

Figure 2.2 depicts the decoder structure of MPEG-4 FGS [15]. Like the encoder, the decoder of MPEG-4 FGS [15] also includes a base-layer decoder and an enhancement-layer decoder. The base-layer decoder acts as a non-scalable decoder and the enhancement-layer decoder is employed for the reconstruction of refinement residue.

At the enhancement layer, the bit-planes are first gathered, shifted, and inverse transformed. Then the reconstructed base-layer frame is added to the reconstructed residue to produce the final output. In practice, the quality of reconstructed frame depends on the number of bit-planes

(34)

DCT Q Q-1 IDCT Motion Compensation Motion

Estimation MemoryFrame

VLC Input Video Base Layer Bit-stream Bit-plane Shift Find Maximum Bit-plane VLC _Enhancement Bit-stream Enhancement Layer Encoder

DCT

Clipping _{Base Layer Encoder}

Figure 2.1: System block diagram of MPEG-4 FGS encoder.

that are received and decoded. Due to the open-loop prediction, the enhancement layer can be arbitrarily truncated and dropped without introducing drifting errors. Thus, fine granular SNR scalability is achieved.

2.3 Embedded Bit-Plane Coding

For the embedded coding, the DCT coefficients at the enhancement layer are coded from the MSB bit-plane to the LSB bit-plane. In each bit-plane, the transform blocks in a frame are ordered with raster scanning and the coefficients in a block are coded in zigzag order. An example of bit-plane coding is shown in Figure 2.3, where we assume the enhancement-layer frame has only 4 transform blocks and each block simply contains 3 transform coefficients. For the exact operation, Table 2.1 lists the pseudo code of the bit-plane coding.

Basically, the coding is started with the examination of a MSB REACHED symbol. For each bit-plane, a MSB REACHED symbol is coded prior to the coding of a transform block to notify whether the MSB bit-plane1 _{is reached. To avoid coding the redundant bits before} the MSB bit-plane, the coding of a transform block is enabled only when its MSB bit-plane is found. Specifically, a MSB REACHED symbol of value 1 signals the occurrence of a MSB

(35)

bit-Chapter 2. MPEG-4 Fine Granularity Scalability VLD Q -1 _IDCT Motion Compensation Frame Memory Bit-plane VLD IDCT

Enhancement Layer Decoder

Base Layer Bit-stream

Enhancement Bit-stream

Base Layer Video (optional output) Enhancement Layer Video Clipping Clipping Bit-plane Shift

Base Layer Decoder

Figure 2.2: System block diagram of MPEG-4 FGS decoder.

MSB LSB Bit-plane 7 Bit-plane 6 Bit-plane 5 Bit-plane 4 Bit-plane 3 Bit-plane 2 Bit-plane 1 Bit-plane 0

Block 0 Block 1 Block 2 Block 3

AC 1 AC 2 DC 0 AC 2 DC 0 AC 1 AC 1 AC 2 ZigZag Scan Raster Scan DC 0 AC 1 DC 0

(36)

Table 2.1: Pseudo code of the bit-plane coding in MPEG-4 FGS

Set N = Number of transform blocks in an enhancement-layer frame

FOR Bit-plane = MSB bit-plane to LSB bit-plane FOR Block = 1 to N (Raster Scanning)

IF (MSB REACHED[Block] == True)

{

WHILE (EOP != True)

{

ENCODE (Run, EOP) symbol in zigzag order IF (Sign is not yet coded) ENCODE sign bit

} }

ELSE

{

ENCODE a MSB REACHED bit IF (MSB REACHED[Block] == True)

{

WHILE (EOP != True)

{

ENCODE (Run, EOP) symbol in zigzag order IF (Sign is not yet coded) ENCODE sign bit

} } }

plane and enables the coding of a transform block. In Figure 2.3, the MSB REACHED symbols of the block 2 have values of 0 for the first 2 bit-planes, i.e., the bit-planes 7 and 6. However, a symbol of non-zero is coded for the bit-plane 5 when the MSB bit-plane of the block 2 is reached.

As the coding of a transform block is enabled, the coefficient bits in a bit-plane are first represented by Run and End-of-Bit-plane (EOP) symbols. The Run specifies the distance be-tween two non-zero coefficient bits and the EOP notifies about the end of coding in a bit-plane. Specifically, an EOP symbol of value 1 indicates the end of a bit-plane coding. To reduce bit rate, the Run and EOP symbols are jointly coded by Huffman tables. Each combination of (Run, EOP) is assigned with a codeword of variable length according to its probability of occurrence. Since the statistic varies for each bit-plane, different bit-planes are coded with separated tables. In addition, a sign bit will be coded after a (Run, EOP) pair if the sign of the non-zero coeffi-cient is not coded yet. Figure 2.4 gives an example of the bit-plane coding in a transform block. Particularly, we use a shaded rectangle to group the coefficient bits that are coded by a (Run,

(37)

Chapter 2. MPEG-4 Fine Granularity Scalability DC0 BP3

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

MSB LSB Zigzag Scan

AC1 AC2 AC3 AC4 AC5 AC6

1

BP2

BP1

BP0

Figure 2.4: Example of bit-plane coding in a transform block.

Table 2.2: List of (Run, EOP) symbols and sign bits for the example in Figure 2.4

Bit-Plane (Run, EOP) Symbols and Sign Bits BP3 (0, 0), Sign(DC0), (2, 1), Sign(AC3)

BP2 (0, 0), (0, 0), Sign(AC1), (1, 0), (1, 1), Sign(AC5) BP1 (0, 0), (1, 0), Sign(AC2), (1, 0), Sign(AC4), (0, 1)

BP0 (0, 0), (1, 0), (0, 0), (1, 0), (0, 0), Sign(AC6), (0, 1), Sign(AC7)

EOP) symbol. Accordingly, from left to right, Table 2.2 lists the symbols to be coded in order.

2.4 Subjective Quality Enhancement

To achieve better subjective quality, MPEG-4 FGS [15] provides the tools of frequency weight-ing and selective enhancement. In common, both schemes use bit-plane shiftweight-ing to prioritize the coding of DCT coefficients. However, they differ in the level for applying the shifting technique. The following details the functionality and algorithm for each tool.

2.4.1 Frequency Weighting

Frequency weighting is a tool for reducing flickering effect. At the base layer, the high fre-quency coefficients are generally quantized with larger step size. Thus, at the enhancement layer, the residues of high frequency coefficients may have larger magnitude. During the bit-plane coding, the coefficients of larger magnitude are coded with higher priority. Therefore, at the base layer, the high frequency coefficients may be refined prior to the low frequency ones, which causes flickering effect.

(38)

MSB LSB Bit-plane 7 Bit-plane 6 Bit-plane 5 Bit-plane 4 Bit-plane 3 Bit-plane 2 Bit-plane 1 Bit-plane 0

AC 2 AC 2 AC 2 AC 1 Bit-plane 10 Bit-plane 9 Bit-plane 8 AC 1 DC 0 AC 1 DC 0 DC 0 DC 0 3 1 3 1 3 1 3 1 AC 1

Figure 2.5: Example of bit-plane coding with frequency weighting.

To reduce flickering effect so as to improve the subjective quality, MPEG-4 FGS [15] incor-porates a frequency weighting technique. The basic idea is to shift up the coefficients of lower frequency so that they can be coded with higher priority. Figure 2.5 illustrates an example of bit-plane coding with frequency weighting. Respectively, the shifting factors for the DC, AC1, and AC2 coefficients are 3, 1, and 0. As shown, with frequency weighting, the coefficients of lower frequency are more probable to be coded with higher priority. In addition, the coding of each block becomes more uniform due to the weighting.

To specify the shifting factor for each transform coefficient, a frequency weighting matrix is coded at frame level as side information. Particularly, the co-located coefficients share the same shifting factor. Hence, additional 64 shifting factors (for 8x8 DCT) are required for the frequency weighting.

Except for requiring side information, the bit-plane shifting introduces redundant bits for coding, as depicted in Figure 2.5. The improvement of subjective quality is at the cost of poor coding efficiency. On the average, while comparing to the configuration without frequency weighting, enabling frequency weighting causes a PSNR loss of 2∼3dB.

2.4.2 Selective Enhancement

Selective enhancement is a tool for an arbitrary-shaped region-of-interest functionality. Instead of defining the shifting factor for each transform coefficient, selective enhancement specifies the shifting factor for each macroblock. For the macroblocks locating in the regions of interest,

(39)

Chapter 2. MPEG-4 Fine Granularity Scalability MSB LSB Bit-plane 7 Bit-plane 6 Bit-plane 5 Bit-plane 4 Bit-plane 3 Bit-plane 2 Bit-plane 1 Bit-plane 0

AC 2 Bit-plane 10 Bit-plane 9 Bit-plane 8 AC 1 DC 0 AC 1 DC 0 DC 0 DC 0 3 4 AC 2 AC 1 AC 1 AC 2

Figure 2.6: Example of bit-plane coding with selective enhancement.

larger shifting factors are used to assign higher coding priority. Figure 2.6 depicts an example of bit-plane coding with selective enhancement. Like the frequency weighting, the region-of-interest functionality is achieved at the cost of poor coding efficiency since more bit-planes are introduced for coding.

2.5 Error Resilience

For most of the applications using MPEG-4 FGS [15], error resilience is desirable because the video may be transmitted over error-prone channels. For error resilience, MPEG-4 FGS [15] uses a re-sync markering technique. Specifically, a re-sync marker defined as 23 consecutive 0 followed by 1 is periodically coded at macroblock level to prevent error propagation. In addition, the re-sync marker is followed by the location of a macroblock and other necessary information that allows the decoding to be restarted. When errors occur, re-synchronization can be achieved by first searching for the re-sync marker. Then the decoding can be resumed after the re-sync marker.

In addition to sync marker, the syntax of bit-plane start code is also used for error re-silience. In particular, the bit-plane start code is not only for re-synchronization but also for signaling the location of a bit-plane. When several bit planes of a region are lost, the bit-plane start code can stop error propagation and restart the decoding from a specific bit-plane.

Experimental results show that combing these techniques together can produce a good result when errors occur.

(40)

2.6 Summary

In this chapter, we have reviewed the algorithm of MPEG-4 FGS [15]. For the video applica-tions in a heterogenous environment, MPEG-4 FGS [15] provides a DCT-based scalable coding using a layered approach. It compresses the video into a base layer and an enhancement layer. The base layer offers a minimum guaranteed visual quality. Then the enhancement layer refines the quality over that offered by the base layer.

Currently, the base layer is coded by a non-scalable codec while the enhancement layer is coded with an embedded bit-plane coding. To produce an embedded bit-stream, the DCT coefficients of the enhancement layer are coded from the MSB bit-plane to the LSB bit-plane. In each bit-plane, the coding is performed in a frame raster and coefficient zigzag scanning manner. With the embedded property, the enhancement layer can be arbitrarily truncated for the adaptation of channel bandwidth and processing power.

To deliver better subjective quality, MPEG-4 FGS [15] provides a frequency weighting ma-trix and a scheme for selective enhancement. The frequency weighting prioritizes the coding of DCT coefficients so that the coefficients of lower frequency can be coded with higher priority. The purpose is to reduce flickering effect caused by the quantization. By applying the same technique at macroblock level, the selective enhancement offers the functionality of region-of-interest by firstly coding the macroblocks in the specified regions.

Except these tools, MPEG-4 FGS [15] also incorporates a re-sync markering technique to address error resilience. To stop error propagation, the re-sync marker, represented by a specific codeword, is periodically coded in the bit-stream. Moreover, following the re-sync marker is the information required for restarting the decoding. When errors occur, the error propagation is constrained between two re-sync markers and the decoding can be resumed after a re-sync marker.

Although MPEG-4 FGS [15] offers good scalability at fine granularity, current approach suffers from poor coding efficiency and subjective quality. In the following chapters, we will analyze these problems and provide our solutions.

(41)

CHAPTER 3 Enhanced Mode-Adaptive Fine

Granularity Scalability

3.1 Introduction

While offering good scalability at fine granularity, the compression efficiency of MPEG-4 FGS [15] is often much lower than that of a non-scalable codec. Currently, in MPEG-4 FGS [15], the enhancement layer is predicted from the base layer. In most applications, the base layer is encoded at very low bit rate and the reconstructed base layer is often with poor quality. Because the predictor of poor quality cannot effectively remove the redundancy, the coding efficiency is inferior.

Using the enhancement layer for prediction can improve the coding efficiency [10][18][35]. Particularly, PFGS [35] constructs a macroblock predictor from a previous enhancement-layer frame. In addition to the previous enhancement-layer frame, RFGS [11] further exploits a previous base-layer frame while producing a frame-based predictor. In our previous work [18], we offer three macroblock predictors: (1) Type B: the predictor constructed from the current base-layer frame, (2) Type E: the predictor constructed from the previous enhancement-layer

(42)

frame, and (3) Type BE: the predictor constructed from the average of the previous two modes. While differing in constructing the enhancement-layer predictor, all the advanced FGS schemes try to find a better predictor for improving the coding efficiency.

Although the coding efficiency can be improved by using the enhancement-layer frame, drifting errors could occur at low bit rate. This is because the enhancement layer is not guar-anteed being received in an expected manner. The predictor mismatch between encoder and decoder would produce drifting errors. PFGS [32][33][36] stops the drifting errors by enabling a predictor that artificially creates mismatch errors during encoding. The predictor is enabled by a mode decision mechanism [32][33][34]. In RFGS [11], they apply a predictive leaky factor between 0 and 1 to decay the drifting errors. Their method is to multiply the previous enhancement-layer frame with a fractional factor, α. In this thesis, we adaptively use Type B and Type BE predictors to offer two schemes, the reset and fading mechanisms, to stop/reduce drifting errors. During the predictor selection, we estimate the possible drifting errors by intro-ducing a dummy reference frame in the encoder.

While preserving the scalability of MPEG-4 FGS [15], our goal is to offer better coding effi-ciency at all bit rates. Figure 3.1 characterizes our goal in terms of rate-distortion performance. The rest of this chapter is organized as follows: Section 3.2 formulates the problem. Section 3.3 describes our enhanced mode-adaptive FGS (EMFGS) scheme, including the formulations of prediction modes and the mode selection algorithm. Section 3.4 analyzes the distributions of prediction modes in different conditions. Section 3.5 depicts our encoder and decoder structure. Section 3.6 further compares our approach with other advanced FGS schemes and demonstrates the rate-distortion performance of our proposed codec. Finally, Section 3.7 summarizes our work.

3.2 Problem Formulation

The problem that we would like to solve is how to construct a better enhancement-layer predic-tor that improves the coding efficiency while minimizing the degradation from drifting errors.

When our EMFGS, MPEG-4 FGS [15], and other advanced FGS schemes compress the video into a base-layer and an enhancement-layer, there are three assumptions:

(43)

Chapter 3. Enhanced Mode-Adaptive Fine Granularity Scalability

Drifting Loss

Bit Rate Quality (PSNR)

Coding Efficiency Loss

MPEG-4 FGS Advanced FGS

Non-Scalable

Our Goal

Figure 3.1: Comparion of rate-distortion performance for non-scalable codec, MPEG-4 FGS,

and advanced FGS algorithms.

2. Base layer is of low bit rate and low quality. Thus, the residue at the enhancement layer is large.

3. Enhancement layer is not received in an expected manner by encoder/server. Thus, there could be predictor mismatch if we create a prediction loop at the enhancement layer. Based on these assumptions, this section describes (a) the formulation for minimizing pre-diction residue at the enhancement layer, (b) the problem when decoder receives less enhance-ment layer than expected, (c) the formulation of predictor mismatch at the enhanceenhance-ment layer, and (d) the target of constraining mismatch errors.

3.2.1 Predictor for Enhancement Layer

While MPEG-4 FGS [15] predicts the enhancement layer from the base layer, we can exploit the available reconstructed frames at time t to form a better enhancement-layer predictor. Currently, the enhancement-layer predictor PE(t)is like the following function:

PE(t) = f (IB(t)). (3.1)

To construct a better enhancement-layer predictor PE(t), we can optimally exploit all the avail-able reconstructed frames at time t, as illustrated below:

(44)

Because Eq. (3.2) offers more selections for constructing the predictor than Eq. (3.1) does, it is easier to minimize the prediction residue at the enhancement layer as Eq. (3.3).

min_kIo(t)− PE(t)k . (3.3)

Because the residue contains less energy, the reconstructed enhancement-layer frame IE(t)will have better quality.

While the optimal predictor requires multiple frame buffers and motion compensation loops, our predictor is restricted to be constructed from the current base-layer frame and the previous enhancement-layer frame, for lower complexity; that is,

PE(t) = f (IB(t), IE(t− 1)). (3.4)

In this case, we only need two frame buffers and motion prediction loops. To further improve prediction efficiency, one can introduce more frame buffers and adaptively select reference frames as the long-term prediction in [31]. For simplicities of the presentation, we will use Eq. (3.4) instead of Eq. (3.2) for the rest of the theoretic framework. One can easily replace Eq. (3.4) with Eq. (3.2) for more detailed theoretic derivation.

3.2.2 Predictor Mismatch

Although we can construct a better predictor from the reconstructed enhancement-layer frames, using the reconstructed enhancement-layer frames as predictor could create a mismatch prob-lem. This is because decoder may not receive the enhancement layer in an expected manner.

When decoder receives less enhancement layer, a distorted enhancement-layer predictor is reconstructed at decoder side. Because the enhancement-layer predictor is from the recon-structed base-layer frame as well as the reconrecon-structed previous enhancement-layer frame, the enhancement-layer predictor at decoder side becomes ePE(t)instead of PE(t), as shown below:

e

PE(t) = f (IB(t), eIE(t− 1)). (3.5)

The difference between PE(t) in Eq. (3.4) and ePE(t) in Eq. (3.5) is the predictor mismatch between encoder and decoder.

(45)

Chapter 3. Enhanced Mode-Adaptive Fine Granularity Scalability picture equals to the summation of the predictor and the residue, as the following:

e

Io(t) = ePE(t) + (t). (3.6)

In this case, even if we received the correct residue at time t, we cannot reconstruct prefect pictures without errors:

Error = Io(t)− eIo(t)

= ( (t) + PE(t))− ( (t) + ePE(t)) (3.7) = (PE(t)− ePE(t)).

3.2.3 Drifting and Accumulation Errors

For more details, we further illustrate the mismatch problem using the end-to-end transmission model shown in Figure 3.2. As illustrated, the enhancement-layer residue at the encoder is

(t) = Io(t)− PE(t) ∀t ≥ 0, (3.8)

and its reconstructed frame for the construction of future predictor is

IE(t) = T runnh (t)i + PE(t) (3.9)

= _{b(t) + P}E(t).

Through an erasure channel, the enhancement layer received by the decoder is modeled as the subtraction of an error term d(t) from the original enhancement-layer (t):

e(t) = (t) − d(t). (3.10)

Therefore, at the decoder, the reconstructed enhancement-layer frame for the construction of future predictor is

e

IE(t), Trunnhe(t)i + ePE(t) = be(t) + ePE(t) ∀t ≥ 0. (3.11) To illustrate the worse case of mismatch effect, we define the enhancement-layer predictor

(46)

-( ) o I t ( ) B I t ( ) E P t ( )t ε Base-Layer Encoder Enh.-Layer Predictor Construction Bit-Plane Truncation + Enh.-Layer Frame Buffer M.C. ( 1) E I t− ˆ( )t ε ( ) B I t ( )t ( )t d t( ) ε% =ε − Base-Layer Decoder Enh.-Layer Frame Buffer Enh.-Layer Predictor Construction M.C. ( 1) E I t% − Bit-Plane Truncation ˆ( )t ε% ₊ + ( ) E P t% ( ) o I t% -( )

d t Unreliable Transm ission

Enhancement-Layer Encoder Enhancement-Layer Decoder

Reliable Transmission

Figure 3.2: An end-to-end transmission model for the analysis of drifting error in the enhanced

mode-adaptive FGS algorithm.

as the previously reconstructed enhancement-layer frame:

PE(t) = f (IB(t), IE(t− 1)) , 0 × IB(t) + 1× MCthIE(t− 1)i . (3.12) Recall that the enhancement layer is not guaranteed to be received in an expected manner. Thus, constructing the predictor purely from the enhancement layer produces the worse case for the mismatch problem. From the definition in Eq. (3.12), the equivalent predictor at the decoder can be written as the following:

e PE(t), MCt D e IE(t− 1) E . (3.13)

To represent the predictor at the decoder as a function of received enhancement-layer, we substitute Eq. (3.11) into Eq. (3.13). After the recursive substitution, we have the following expression: e PE(t) = M Ct D e IE(t− 1) E = M Ct D be(t − 1) + ePE(t− 1) E (3.14) = M Ct D be(t − 1) + MCt−1 D be(t − 2) + MCt−2 D ... + M C1 D be(0)EEEE.

(47)

Chapter 3. Enhanced Mode-Adaptive Fine Granularity Scalability By further substituting Eq. (3.10) into Eq. (3.14), we can group all the transmission errors together as the following:

e PE(t) = M Ct D b(t − 1) + MCt−1 D b(t − 2) + MCt−2 D ... + M C1 D b(0) + ePE(0) EEEE − M Ct D b d(t_{− 1) + MC}t−1 D b d(t_{− 2) + MC}t−2 D ... + M C1 D b d(0) EEEE (3.15) = PE(t)− MismatchError,

where ePE(0) = PE(0) = IB(t) because the enhancement-layer predictor for the first intra-frame is from the base layer. The first term in Eq. (3.15) is the enhancement-layer predictor PE(t) at the encoder and the grouped error terms become the equivalent predictor mismatch error, as the following:

M ismatchError = PE(t)− ePE(t) = t−1 X i=0 b d(i), (3.16)

where we save the expressions of the motion compensations for notation simplicity. From Eq. (3.16), the transmission error further creates two kinds of errors:

1. Drifting error: For a single transmission error at time j, i.e., d(j)δ[i − j], the mismatch error in Eq. (3.16) can be expressed as bd(j)μ[t_{− 1 − j]. In other words, the transmission} error at time j drifts to the enhancement-layer predictors after time j, i.e., {PE(t)|t > j}. 2. Accumulation error: The equivalent predictor mismatch error at frame j is the

accumu-lation of transmission errors before frame j, i.e., j−1P i=0

b

d(i). It is a consequence of drifting error and temporal prediction.

3.2.4 Constraining Predictor Mismatch

Although a better predictor can bring coding gain at high bit rate, it could introduce drifting er-rors at low bit rate. While optimizing the performance at different bit rates, a dilemma situation may occur. As a result, our goal is to find the predictor function f(·) in Eq. (3.2) that minimizes the prediction residue at the enhancement layer, as described in Eq. (3.17),

(48)

and constraints the predictor mismatch described by the following:

||PE(t)− ePE(t)|| ≤ T hreshold. (3.18)

To find the best trade-off between prediction residue and mismatch error, we employ a La-grange multiplier as in traditional rate-distortion optimization problem. We observe that the prediction residue, kIo(t)− PE(t)k, is inversely proportional to the mismatch error, ||PE(t)−

e

PE(t)||, as depicted in Figure 3.3. In other words, as more enhancement layer is used for predic-tion, predictor of better quality can reduce the prediction residue. However, if the enhancement layer is not received, more serious mismatch error may occur. Therefore, according to Lagrange principle, the optimal predictor function is the one that minimizes the Lagrange cost:

min(λ_{× kI}o(t)− PE(t)k + ||PE(t)− ePE(t)||). (3.19) In practice, we find that the convex property in Figure 3.3 is not guaranteed; that is, the Lagrange solution may lead to a sub-optimal solution. However, even with such imperfection, our heuristic solution still follows the Lagrange principle, i.e., the determination of predictor function should consider both coding gain and drifting loss.

3.3 Mode-Adaptive Prediction

While the previous section formulates the problem, this section describes our proposed scheme, including our new enhancement-layer predictors for better coding efficiency, our mechanisms to reduce drifting error, and our adaptive mode-selection scheme.

3.3.1 Prediction Modes for Minimizing Residue

To improve coding efficiency, we need to minimize the prediction residue, as Eq. (3.17). Our method is to offer a set of better predictors through the available enhancement-layer frames.

We create three macroblock predictors for the prediction of the enhancement-layer. In ad-dition to the predictor of MPEG-4 FGS [15], we have two adad-ditional predictors that utilize the previous enhancement-layer frame and the current base-layer frame. Their corresponding math-ematical formulations are listed in Table 3.1 and the functionality for each mode is described as the following:

可調視訊編碼之高等細緻可調性研究

國

立

交

通

大

學

電子工程系電子研究所

博 士 論 文

可調視訊編碼之高等細緻可調性研究

Scalable Video Coding

Advanced Fine Granularity Scalability

研 究 生 ：彭文孝

可調視訊編碼之高等細緻可調性研究

Scalable Video Coding – Advanced Fine Granularity Scalability

研 究 生：彭文孝 Student：Wen-Hsiao Peng

指導教授：蔣迪豪、李鎮宜 Advisor：Dr. Tihao Chiang & Dr. Chen-Yi Lee

國 立 交 通 大 學

電 子 研 究 所

博 士 論 文

可調視訊編碼之高等細緻可調性研究

學生：彭文孝

指導教授：蔣迪豪教授

李鎮宜教授

國立交通大學電子研究所 博士班

摘 要

為了在異質而多樣的環境下進行視訊串流/廣播，MPEG-4 定義了一個細緻

可調視訊編碼。透過截斷位元流的方式，

MPEG-4 可調視訊編碼可以優雅的

降低視訊品質。儘管目前的編碼演算法提供了很好的細緻可調性，可是現

有的方法卻有較差的編碼效率和較差的主觀視覺效果。

本論文從預估器，位元層編碼，到傳送順序提供一個整合的方式來改進

目前

MPEG-4 細緻可調視訊編碼。具體而言，我們提出了一個增強型適應

性細緻可調視訊編碼和一個內文適應性位元層編碼來達到較高的編碼效

率。更近一步的，從內文適應性位元層編碼架構，我們開發了一個估測性

位元重排技術用以改善主觀的視覺效果。

為了提高編碼效率，我們的增強型適應性細緻可調視訊編碼同時藉由基

本層和增強層來建構一組較佳的預估器。為了最小化可能的漂移錯誤，我

們在編碼端利用一個多餘的預估回圈來模擬在解碼端的漂移行為。接著，

在一個理論基礎的導引下，我們在不同的估測模式中切換，使得能在最小

適應性細緻可調視訊編碼可以有超過

1~1.5dB 的 PSNR 效能改進。

除了建構較佳的預估器，我們也使用一個內文適應模式和位元依序方法

來編碼增強層，進而改進編碼效率。為了充分利用已存在的相關性，內文

參照模型的設計同時利用了單一轉換方塊中的能量分佈和相鄰轉換方塊中

的空間相關性。同時，跨位元層的相關性也被用來減少附屬資訊。比起

MPEG-4 的位元層編碼，我們的內文適應性位元層編碼更進一步的達到

0.5~1dB 的 PSNR 改善。

透過以位元為單位的運作，估測性位元重排技術藉由一個能依據視訊內

容來精化基本層的方式來改善主觀視覺效果。具體而言，我們重排在增強

層的係數位元使得較多能量的區域被指定較高的精化優先權。特別的是，

為了避免傳送實際的編碼順序，精化優先權由一個使用最大可能性原理所

推導的模型來決定。比起使用固定的掃描方式，我們的估測性位元重排技

術提供了較佳的主觀視覺效果，並且保有近似或更高的編碼效率。

總結，本論文證明了目前

MPEG-4 細緻可調視訊編碼的壓縮效能和主觀

視覺效果可以被顯著改善。細緻可調視訊編碼和非可調編碼的效能差距可

以被縮短。此外，所提的技術可以運用在未來的可調視訊編碼標準。

Scalable Video Coding –

Advanced Fine Granularity Scalability

Student：Wen-Hsiao Peng

Advisors： Dr. Tihao Chiang

& Dr. Chen-Yi Lee

Institute of Electronics

National Chiao Tung University

ABSTRACT

Contents

List of Tables

List of Figures

© Copyright 2005 by Wen-Hsiao Peng

All Rights Reserved

CHAPTER 1

Introduction

1.1 Overview of Dissertation

1.1.1 MPEG-4 Fine Granularity Scalability

1.1.2 Enhanced Mode-Adaptive Fine Granularity Scalability

1.1.3 Context-Adaptive Bit-Plane Coding

1.1.4 Applications in Scalable Video Coding Standard

1.2 Organization and Contribution

博士論文

研究生：彭文孝

研究生：彭文孝 Student：Wen-Hsiao Peng

國立交通大學

電子研究所

博士論文

國立交通大學電子研究所博士班

摘要