對可伸展式視訊編碼之最佳碼率分配

(1)

國立臺灣大學電機資訊學院電子工程學研究所博士論文

Graduate Institute of Electronics Engineering College of Electrical Engineering & Computer Science

National Taiwan University doctoral dissertation

對可伸展式視訊編碼之最佳碼率分配

Optimal Rate Allocation for Scalable Video Coding

彭冠舉 Peng, Guan-Ju

指導教授：陳少傑博士黃文良博士

Advisor: Chen, Sao-Jie, Ph.D.

Hwang, Wen-Liang, Ph.D.

中華民國 101 年 1 月

January, 2012

(2)

(3)

(4)

(5)

誌謝

有幸能夠完成博士班的修業，要感謝的人、事、物，難以勝數。首先要感謝我的父母與家人，由於他們多年來無條件的支持，我方能無後顧之憂地完成學業。

再來要謝謝陳少傑老師以及黃文良教授。不僅是論文上的悉心指導，在生活上，

為人處事方面，他們的關心與提點都是我一生無法估量的珍寶。謝謝諸位口試委員，貝蘇章老師、何建明老師、傅楸善教授與簡韶逸教授。百忙之中撥冗參加在下的論文口試，實是不勝感激。最後，謝謝所有關心我幫助我的師長、同學與朋友，你們這幾年的支持與鼓勵，都成為我最甘美的回憶，願永遠你們平安喜樂。

(6)

(7)

中文部分

(8)

(9)

摘要

我們在可伸縮視頻編碼中考慮了每個使用者對於不同解析度的偏好。並且根據這些用戶的偏好，我們制定並解決基於小波變換的可伸縮視頻編碼和 H.264/SVC 的位元分配問題。首先我們考慮基於小波變換的視頻編碼器的位元分配方法。我們提出了三種方法來解決位元分配的問題。首先是使用拉格朗日的高效率方法來解決優化問題上限，其二是效率較低的動態規劃法，但其可以得到問題的最佳解。這兩種方法都需要先了解用戶的喜好。對於未知用戶喜好的情況下，

我們用最小化最大失真法來解決問題。我們發現，最糟糕的表現往往發生在所有的用戶都訂閱相同的解析度的時候。因此，最小化最大失真法與小波編解碼器傳統的位元分配方法相同。根據我們進行多次的實驗，這些實驗分別考量了各種用戶的偏好，結果表明，了解用戶的喜好顯著提高的可伸縮視頻編解碼器的編碼性能。H.264/SVC 的位元分配問題要複雜得多，我們必須了解並分析 H.264/SVC 多層編碼結構造成的失真。在這篇論文中，我們具體地分析了用於實現時間 (temporal)，空間(spatial)和質量(quality)的可伸縮視頻編碼（SVC）的編碼結構，

並且根據分析的結果，提出了兩個對於失真-碼率線（R-D Curve）的優化算法：

其一是已知用戶偏好的最優化演算法，另一個是最小化最大失真法。與目前最先進的位元分配法相較，當用戶的偏好都是已知的時候，我們的演算法在壓縮效率上有著顯著的改進。這篇論文中，我們對可伸縮視頻編碼提出了用戶偏好的概念，

並在兩個最常見的可伸縮性視頻編碼方法中解決相應的位元分配問題。它們分別是，MCTF-EZBC 基於小波編碼器和 H.264/SVC。在比較已知使用者偏好方法與未知使用者偏好方法的效能之後，我們亦驗證了可伸縮視頻編碼中用戶偏好的重要性。

(10)

(11)

目錄

第一章序論 ... 伍第二章可伸縮式視訊編碼之效能與使用者偏好... 陸第三章對基於小波轉換之可伸縮式視訊編碼之碼率分配... 柒第四章對 H.264/SVC 之碼率分配 ... 捌第五章結論 ... 玖

(12)

(13)

第一章 序論

為了因應近年來視訊廣播的普及，應用於視訊廣播的可伸縮式視訊編碼變得益加重要。在目前的可伸縮式視訊編碼的方法中，較為重要的有二，其一是 MCTF-EZBC 基於小波轉換之視訊編碼方法，另一個則為 H.264/SVC。其中，由於碼率分配的方式影響壓縮的效能甚巨，因此可伸縮式視訊編碼最大的難題之一，

就是如何適切地做好碼率分配以滿足大部分使用者的需求。為了達到這個目的，

我們分別對 MCTF-EZBC 基於小波轉換之視訊編碼以及 H.264/SVC 這兩種方法做了詳盡的分析，並且根據這些分析，提出了相對應的碼率分配方法。實驗的結果顯示，在考量使用者資訊以及偏好的情況下，我們的碼率分配方法優於現存之其他碼率分配方法。

(14)

第二章

可伸縮式視訊編碼之效能與使用者偏好

在這一章裡面，我們解釋並分析了視訊廣播的架構，並且由此提出了一個考量使用者偏好的可伸縮式視訊編碼效能標準。並且藉由一些適當的分析，我們從這個效能標準中推導出在考慮最佳化碼率分配時，所應該處理的目標函式。無論對於基於小波之視訊壓縮方法或是 H.264/SVC，這個目標函式在開發它們碼率分配方法上，一直扮演著重要的角色。

(15)

第三章

對基於小波轉換之可伸縮式視訊編碼之碼率分配

在本章之中，我們回顧並分析了 MCTF-EZBC 基於小波之視訊壓縮方法。並且提出了三種方法來解決碼率分配的問題。首先是使用拉格朗日的高效率方法來解決優化問題上限，其二是效率較低的動態規劃法，但其可以得到問題的最佳解。

這兩種方法都需要先了解用戶的喜好。對於未知用戶喜好的情況下，我們用最小化最大失真法來解決問題。我們發現，最糟糕的表現往往發生在所有的用戶都訂閱相同的解析度的時候。因此，最小化最大失真法與小波編解碼器傳統的位元分配方法相同。根據我們進行多次的實驗，這些實驗分別考量了各種用戶的偏好，

結果表明，了解用戶的喜好顯著提高的可伸縮視頻編解碼器的編碼性能。

(16)

第四章

對 H.264/SVC 之碼率分配

在本章之中，我們使用回顧了 H.264/SVC，並且使用矩陣以及向量來表示 H.264/SVC 的壓縮方法，以利開發相對應的碼率分配方法。與基於小波的視訊壓縮方法相比較，H.264/SVC 的碼率分配問題要複雜得多，因為我們必須了解並分析 H.264/SVC 多層編碼結構造成的失真。在這篇論文中，我們具體地分析了用於實現時間(temporal)，空間(spatial)和質量(quality)的可伸縮視頻編碼（SVC）的編碼結構，並且根據分析的結果，提出了兩個對於失真-碼率線（R-D Curve）的優化算法：其一是已知用戶偏好的最優化演算法，另一個是最小化最大失真法。

實驗結果表明，與目前最先進的位元分配法相較，當用戶的偏好都是已知的時候，

我們的演算法在壓縮效率上有著顯著的改進。

(17)

第五章 結論

這篇論文中，我們對可伸縮視頻編碼提出了用戶偏好的概念，並在兩個最常見的可伸縮性視頻編碼方法中解決相應的位元分配問題。它們分別是， MCTF-EZBC 基於小波編碼器和 H.264/SVC。在比較已知使用者偏好方法與未知使用者偏好方法的效能之後，我們亦驗證了可伸縮視頻編碼中用戶偏好的重要性。

(18)

(19)

英文部分

(20)

(21)

ABSTRACT

The scalable video coding problem is investigated, and based on the preferred resolution, the bit allocation problems of wavelet-based scalable video coding and H.264/SVC are formulated and solved.

For the wavelet-based video encoder, three methods are proposed. The first is an efficient Lagrangian-based method that solves the upper bound of the problem optimally, and the second is a less efficient dynamic programming method that solves the problem optimally. Both methods require knowledge of the user preference on resolution. For the case where the user preference is unknown, we solve the problem by a min-max approach. Our objective is to find the bit allocation solution that maximizes the worst possible performance. We show that the worst performance occurs when all users subscribe to the same spatial, temporal, and quality resolutions. Thus, the min-max solution is exactly the same as the traditional bit allocation method for a non-scalable wavelet codec. We conduct several experiments on the 2D+t MCTF-EZBC wavelet codec with respect to various subscriber preferences. The results demonstrate that knowing the user preferences improves the coding performance of the scalable video codec significantly.

For the rate allocation problem of H.264/SVC, we present a theoretical analysis of the distortion in multiple layer coding structures. Specifically, we analyze the prediction structure used to achieve temporal, spatial, and quality scalabilities in a scalable video coding (SVC), and show that the average peak-signal-to-noise (PSNR) of SVC is a weighted combination of the bit rates assigned to all the streams. We propose two rate- distortion (R-D) optimization algorithms: one employs the known user preference, and the other is based on the min-max approach which assumes the least favorable prior of the user preference. We compare the performance of our algorithms with that of a state-of-the-art scalable bit allocation algorithm and demonstrate that they outperform

(22)

the compared approach when the user preference is known to both coders.

In this Dissertation, we propose the concept of the user preference in the scalable video coding, and solve the corresponding rate allocation problems for the two most prevalent scalable video coding methods, which are the MCTF-EZBC wavelet based encoder and H.264/SVC. After comparing the coding gains of the methods with com- plete preference information over those with incomplete preference information, we verify the importance of the user preference in the scalable video coding.

(23)

TABLE OF CONTENTS

ABSTRACT . . . i LIST OF FIGURES . . . v LIST OF FIGURES TABLES . . . vii 1 INTRODUCTION . . . 1 2 SVC’S PERFORMANCE AND SUBSCRIBER PREFERENCES . . . 7 2.1 Video Broadcasting System and Performance Metrics of SVC . . . 7 2.2 Comparison of Wavelet Based Codec and H.264/SVC . . . 10 3 RATE ALLOCATION FOR WAVELET BASED SVC . . . 13 3.1 MCTF-based 2D+t Wavelet Codec . . . 13 3.1.1 Spatial Temporal Subband Weighting . . . 16 3.2 Formulation of the Rate-Distortion Function . . . 17 3.3 Solving Rate-Allocation with Known Preferences . . . 19 3.3.1 Lagrangian-based Solution . . . 20 3.3.2 Optimal Solution Based on Dynamic Programming . . . 24 3.4 Min-Max Approach for Unknown Preferences . . . 29 3.5 Experiment Results . . . 31 4 RATE ALLOCATION FOR H.264/SVC . . . 41 4.1 Rate-Allocation Problem for H.264/SVC . . . 41 4.1.1 Rate-Distortion Model . . . 41 4.1.2 Layer Dependency and Sequence of Approximations . . . 42 4.2 Prediction Residuals and Distortion Propagation . . . 44 4.2.1 Prediction Residuals . . . 44 4.2.2 Distortion Propagation in the Predictions . . . 49 4.3 Distortion of a Layer . . . 54

(24)

4.3.1 Prediction Error Propagation in a Temporal Level . . . 55 4.3.2 Exploring Error Propagation . . . 56 4.4 Average Distortion of SVC . . . 57 4.5 Solving the Inter-Layer Rate Allocation Problem . . . 59 4.5.1 Optimal Bit Allocation with Fixed Weights . . . 60 4.5.2 Optimal Bit Allocation Algorithm . . . 62 4.5.3 Min-max Approach for Incomplete Preference Information . . . 63 4.6 Implementation Issues and Experimental Results . . . 65 4.6.1 Coding Structure and Implementation Details . . . 65 4.6.2 Variance Approximation . . . 66 4.6.3 Performance Comparison . . . 69 5 CONCLUSION . . . 77 APPENDICES . . . 79 A.1 Preference Settings in the Third Experiment of the Wavelet-Based Codec 79 A.2 Variance Approximation . . . 83 A.3 The Two-Stream Relation of Temporal Prediction at a Low Bit Rate . . 86 A.4 The Two-Stream Relation of Spatial Prediction at a Low Bit Rate . . . . 88 A.5 The Two-Stream Relation of Quality Prediction at a Low Bit Rate . . . 90 A.6 Supplementary Results of the Rate Allocation Methods for H.264/SVC 92 REFERENCE . . . 102

(25)

LIST OF FIGURES

Figure 2.1 A Video Broadcasting System . . . 8 Figure 2.2 Wavelet Based Coding Structure . . . 11 Figure 2.3 H.264/SVC Coding Structure . . . 12 Figure 3.1 An Example of Two-level Temporal Wavelet Decomposition . . . . 14 Figure 3.2 An Example of Indexing the Subbands . . . 15 Figure 3.3 Construction of a DP Graph . . . 26 Figure 3.4 Optimal Rate-Distortion Path to Quality Resolution r_i . . . 29 Figure 3.5 Comparison of the Performance under Different Spatial Preferences 35 Figure 3.6 Comparison of the Performance under Different Temporal Preferences 36 Figure 3.7 Comparison of the Performance under Different Preferences . . . 37 Figure 3.8 PSNR Gain for the Spatial Resolution in the Wavelet Codec . . . 38 Figure 3.9 PSNR Gain for the Temporal Resolution in the Wavelet Codec . . . 39 Figure 4.1 Data Dependency Structure of H.264/SVC . . . 43 Figure 4.3 Approximation of the Distortion due to Temporal Prediction . . . 67 Figure 4.4 Approximation of the Distortion due to Spatial Prediction . . . 68 Figure 4.5 Approximation of the Distortion due to Quality Prediction . . . 70 Figure 4.6 PSNR Gain of Proposed over Lagrangian in Temporal Scalability . . 74 Figure 4.7 PSNR Gain of Proposed over Lagrangian in Spatial Scalability . . . 74 Figure 4.8 PSNR Gain of Proposed over Lagrangian in Quality Scalability . . . 75 Figure 4.9 PSNR Gain of Proposed over MaxMin in Temporal Scalability . . . 75 Figure 4.10 PSNR Gain of Proposed over MaxMin in Spatial Scalability . . . . 76 Figure 4.11 PSNR Gain of Proposed over MaxMin in Quality Scalability . . . . 76 Figure A.1 Visual Results for(μ0,μ1,μ2) = (0.8,0.1,0.1). . . 93 Figure A.2 Visual Results for(μ0,μ1,μ2) = (0.6,0.2,0.2). . . 94

(26)

Figure A.3 Visual Results for(μ0,μ1,μ2) = (0.4,0.3,0.3). . . 95 Figure A.4 Visual Results for(μ0,μ1,μ2) = (0.1,0.8,0.1). . . 96 Figure A.5 Visual Results for(μ0,μ1,μ2) = (0.2,0.6,0.2). . . 97 Figure A.6 Visual Results for(μ0,μ1,μ2) = (0.3,0.4,0.3). . . 98 Figure A.7 Visual Results for(μ0,μ1,μ2) = (0.1,0.1,0.8). . . 99 Figure A.8 Visual Results for(μ0,μ1,μ2) = (0.2,0.2,0.6). . . 100 Figure A.9 Visual Results for(μ⁰,μ¹,μ²) = (0.3,0.3,0.4). . . 101

(27)

LIST OF TABLES

Table 4.1 Proposed Optimal Bit Allocation Algorithm for H.264/SVC . . . 64 Table A.1 Setting 1 for the Third Experiment of the Wavelet Based Codec . . . 80 Table A.2 Setting 2 for the Third Experiment of the Wavelet Based Codec . . . 81 Table A.3 Setting 3 for the Third Experiment of the Wavelet Based Codec . . . 82

(28)

(29)

CHAPTER 1 INTRODUCTION

Scalable video coding (SVC) facilitates the encoding of a bitstream containing rep- resentations with lower spatial resolutions, frame rates, and quality, which are designed to meet the requirements of the heterogeneous display and computational capabilities of a target device. A client with restricted resources (display resolution, processing power, and bandwidth) can only decode a part of the delivered bitstream. Thus, SVC can be used in a wide range of multicast applications, such as Internet and wireless applications, where scalability is necessary in order to deal with the variable transmission conditions to the end-users. Another benefit of SVC is that it can adapt to a network- aware environment on-the-fly [1,2] when feedback is provided by the network and the end-users.

An important issue in SVC is how to measure the relative importance of a resolution to the overall coding performance. Many researchers emploed a weighting coefficient to represent the relative importance of a resolution. For example, Ramchandran, Ortega, and Vetterli [3] modeled the distortion as a summation of the weighted mean- square-errors (MSEs) on different resolutions, and proposed a bit-allocation algorithm based on the exhaustive search technique. Schwarz and Wiegrand [4] adopted a simi- lar approach by weighting the MSEs of the base layer and the enhancement layer, and demonstrated the effect of employing different weights on each layer on the overall coding performance. The above works do not explain the meaning of the weights or how to derive them. Since the peak-signal-to-noise ratio (PSNR) is most commonly used as a quality measurement of a coding system, in this Dissertation, instead of weighting the MSE of a resolution, we weight the PSNR as a measurement of the resolution relatively important to the overall coding performance.

(30)

A good coding performance metric for SVC should consider the subscriber preference for different resolutions. For example, if we want to produce bitstreams in two scenarios: one where all the subscribers prefer the QCIF display and the other where all the subscribers prefer the CIF display, then the optimal bitstreams for the two scenarios should be different. In the first scenario, the optimal bit allocation can only be obtained by allocating all the bits to the subbands that support the QCIF display. Obviously, this allocation cannot be optimal for the second scenario in which the optimal bit allocation must encode more spatial subbands to support a higher spatial resolution display with the CIF format.

Currently, there exist two promising frameworks for SVC. One is the wavelet- based coding method [5–9] and the other is H.264/SVC [10,11]. In the wavelet-based coding method, the video is first decomposed into multiple subbands, and then these subbands are respectively coded by the EZBC entropy coder [12,13]. Since the subbands are encoded independently, the rate allocated to one subband does not affect the distortions of the others. This property decreases the complexity of the rate allocation problem, but causes less coding efficiency compared to H.264/SVC, which facilitates several techniques to remove the redundancy among the layers. It is obvious that the rate allocation problem of H.264/SVC is more complicated because the dependency among the layers should be analyzed before the rate allocation problem can be solved.

We analyze and solve the rate allocation problem of the wavelet-based coding method as follows: (1) based on the resolution preference, we formulate the bit allocation problem for the wavelet based SVC, and show that the weighting coefficients can be derived from the subscriber preferences on different resolutions in a motion compensation temporal filtering (MCTF)-based 2D+t wavelet video codec [14]; and (2) we propose three bit-allocation algorithms to solve the problem. The first is an efficient Lagrangian-based method that solves the upper bound of the problem optimally, and the second is a less efficient dynamic programming method that solves the problem opti-

(31)

mally. Both methods require knowledge of user preference. For the case where the user preference is unknown, we solve the problem by a min-max approach, which objective is to optimize the bit allocation solution for the worst possible preference distribution.

The overall performance of our approach is highly dependent on whether the preference on resolutions are provided to the wavelet codec. If they are provided, then our methods can achieve an overall PSNR that is at least as good as that of the state- of-the-art 3D wavelet codec in [15]. The PSNR gain of our method over that in [15]

depends on the subscriber preference patterns. Our experiments on various video se- quences with known preferences demonstrate that the overall PSNR gain of our method over that in [15] can range from 0−25 dB when only spatial scalability is applied, from 0− 5 dB when only temporal scalability is applied, and from 0 − 25 dB when both spatial and temporal scalability are applied. In fact, we show that the codec in [15] is a special case of our system where all the subscribers prefer the highest spatial and temporal resolutions. As a consequence, our codec has 0 dB gain over that in [15] under such particular preference pattern.

In practice, the subscriber preference is inaccessible to many scalable coding applications. To address the problem, we propose an algorithm based on the min-max approach to derive the optimal bit allocation when the subscriber preference distribution is not provided. We show that, under the min-max approach, the least favorable user preference distribution occurs when all users subscribe to the highest spatial and temporal resolutions. In that case, the bit-allocation problem of SVC is exactly the same as that of non-scalable video coding, where the goal is to solve the problem for one particular resolution optimally. Our experiment results show that, for a scalable video codec, there is a significant PSNR gap between scenarios where the user preferences are known and scenarios where the preferences are not known. Finding an approach to reduce the gap is beyond the scope of the present study, so we defer the matter to a future work.

(32)

On the other hand, H.264/SVC is a state-of-the-art SVC codec that significantly reduces the gap in rate-distortion (R-D) efficiency between single layer coding and scalable coding [10,11]. The performance of SVC depends to a large extent on the settings of several parameters [16]. The quantization parameters (QP), the ratio of the I, P, and B frames, and the target bit rate have the most influence on the performance. In this Dis- sertation, we study the multiple-layer bit rate allocation problem in SVC, also known as the optimal quantization parameter (QP) assignment to each layer in SVC. With the ob- jective of simplifying the analysis without affecting its generality, we use a fixed set of values for several SVC coding parameters. Specifically, we assume that the motion vectors have been acquired already. In addition, we use the hierarchical B-frame structure for temporal scalability and inter-layer residual prediction for spatial and coarse-grain quality scalabilities [11].

The optimal bit allocation of a rate-constrained encoder control system is usually derived by applying the Lagrangian technique [17]. In contrast to the single-layer video coding, SVC requires that all users are served simultaneously in a single bitstream.

Thus, the data items in an SVC bitstream are highly correlated to each other. This inter-dependency can cause a coding error in one layer to propagate to other layers and thereby complicate the bit allocation process. Another factor that affects bit allocation under SVC is the end-user preference. For example, the bit allocation scheme for the user subscribed to the highest resolution should be different from that for the user subscribed to the lowest resolution, since the latter only uses the base layer information.

Hence, the preferences for some resolutions should also be considered by the bit allocation scheme. However, incorporating user preferences into the bit allocation process implies that the preference information should be acquired by the encoder through a feedback mechanism. This is usually considered as a disadvantage in a broadcasting environment.

(33)

In [4], Schwartz and Wiegand proposed an encoder control mechanism that jointly optimizes the coding parameters of the base layer and enhancement layers under H.264/SVC.

Their algorithm also utilizes a weighted combination of the distortions of all the layers to balance the coding efficiency of different layers. Although the above approaches demonstrated the correlation between the coding performance and the values of the weighting factors, analyses of the derivation of the weighting factors were not provided.

Recently, Koziri and Eleftheriadis [18] presented an interesting approach that models the distortion dependency between layers as a stochastic process for joint optimization of scalable coding. However, their analysis is limited to Gaussian sources and spatial dependency.

To solve the rata allocation problem of H.264/SVC, we propose a theoretical analysis on the weighting factors. We analyze the effect of a coding error in one layer over the other layers in terms of the residual prediction of temporal, spatial, and quality scal- abilities under SVC. Then, we demonstrate that the weighting factor of a layer i is a function of all the layers affected by the coding error in layer i, and the end-user pref- erence for subscribing to the affected layers. Based on the analysis, we derive the main result, namely, the average PSNR can be represented as the weighted combination of the bit rates assigned to each layer, where the coefficient is a weighting factor. We also propose an R-D optimization algorithm. Experiments on H.264/SVC JSVM 9.18 [19]

demonstrate that our algorithm achieves a significant improvement in the average PSNR over that of the state-of-the-art method in [3,4]. We also show that knowing the user’s preference can significantly improve the coding performance of a scalable video coder.

The remainder of this Dissertation is organized as follows. In the next chapter, we consider several issues that are relevant to the performance measurement of SVC. In Chapter 3, we analyze and solve the rate allocation problem of the wavelet-based SVC.

In Chapter 4, we analyze the dependency of the rate-distortion curves in the prediction structure of H.264/SVC. According to the results obtained, we also formulate and

(34)

solve the rate allocation problem of H.264/SVC. Chapter 5 contains some concluding remarks.

(35)

CHAPTER 2

SVC’S PERFORMANCE AND SUBSCRIBER PREFERENCES

In this chapter, we introduce the performance metrics of SVC in a video broadcasting system in 2.1, and compare two prevalent SVC frameworks, which are the wavelet based video codec and H.264/SVC, in Section 2.2.

2.1 Video Broadcasting System and Performance Metrics of SVC

A general video broadcasting system consists of a video source, a scalable video coder, broadcasting servers, a network, and subscribers, as shown in Figure 2.1. The scalable coder encodes a source video such that the network’s bandwidth requirement be met and the subscriber demand can be satisfied. The satisfaction of the subscriber demand can be quantified to measure the system performance. In [20], the performance of SVC is measured as follows:

Q_all=_i∈N

∑

^Qⁱ^, ^(2.1)

where N denotes the set of subscribers, and Qi denotes the satisfaction of subscriber i’s demand by SVC, which is usually measured by the PSNR. However, we found that the PSNR is not sufficient to satisfy the demand of a subscriber because he/she may prefer a higher frame rate or spatial resolution than the PSNR. Thus, we introduce a preference factorψ ∈ [0,1] for each subscriber and combine it with the PSNR to obtain the following performance measurement:

Q_all=

∑

i∈NψiPSNR_i. (2.2)

If we let S, T, and R denote the sets of spatial, temporal, and quality resolutions respectively, then a resolution in SVC can be represented by[s,t,r], where s ∈ S, t ∈ T, and r ∈ R. Denote subscriber i’s preference for the resolution [s,t,r] as ψi,[s,t,r], and let the PSNR of the resolution be PSNR_[s,t,r]. Then, Equation (2.2) can be re-written as

(36)

Figure 2.1: A Video Broadcasting System.

(37)

follows:

Q_all=

∑

s∈S,t∈T,r∈R

PSNR_[s,t,r]_i∈N

∑

^ψ^i,[s,t,r]^. ^(2.3)

The performance measurement can be normalized based on the subscriber preference such that we obtain

Q_average = Q_all

∑s∈S,t∈T,t∈R∑i∈Nψi,[s,t,r]

=

∑

s∈S,t∈T,r∈R

PSNR_[s,t,r]( ∑i∈Nψi,[s,t,r]

∑s∈S,t∈T,t∈R∑i∈Nψi,[s,t,r])

=

∑

s∈S,t∈T,r∈R

PSNR_[s,t,r]μ_[s,t,r], (2.4)

where the preference factor of the[s,t,r] resolution is μ[s,t,r]= ∑q(i)=[s,t,r]ψⁱ

∑s∈S,t∈T,r∈R∑q(i)=[s,t,r]ψi, (2.5)

which represents the proportion of preferences for the resolution [s,t,r]. Since μ_[s,t,r]

considers the preferences of all subscribers for the resolution[s,t,r], it can be regarded as a preference of the system to the resolution. Moreover, from the definition of μ[s,t,r], we haveμ[s,t,r]≥ 0 and

s∈S,t∈T,r∈R

∑

μ[s,t,r]= 1. (2.6)

The PSNR_[s,t,r]can be calculated as follows:

PSNR_[s,t,r]= 10log10

255²

D¯_[s,t,r], (2.7)

where ¯D_[s,t,r] denotes the average mean square error (MSE) of the frames in resolution [s,t,r] (we use D_[s,t,r] to denote ¯D_[s,t,r] in Chapter 4 for simplicity). If we substitute Equation (2.7) into Equation (2.4) and use Equation (2.6), we have

Q_average = 10log₁₀255²s∈S,t∈T,r∈R

∑

μ[s,t,r]−s∈S,t∈T,r∈R

∑

μ[s,t,r]log₁₀D¯_[s,t,r]

= 10log₁₀255²− log10(s∈S,t∈T,r∈R

∏

D¯^μ_[s,t,r]^[s,t,r]). (2.8)

(38)

It is obvious that maximizing the average performance Q_average is equivalent to minimizing the geometric mean of the distortion

s∈S,t∈T,r∈R

∏

D¯^μ_[s,t,r]^[s,t,r]. (2.9)

Note that, in SVC, each temporal resolution involves a different number of frames. If a scalable coder adopts the dyadic temporal structure, which assumes that the number of frames in temporal resolution t is 2^t, then the overall distortion of the resolution[s,t,r]

in a group of pictures (GOP) is

D^GOP_[s,t,r]= 2^tD¯_[s,t,r]. (2.10)

2.2 Comparison of Wavelet Based Codec and H.264/SVC

In a wavelet-based video encoder, the video frames are divided into several group of pictures, and the encoder sequentially deals with each GOP. As shown in Figure 2.2, a GOP is spatially and temporally decomposed into several wavelet subbands by the spatial and motion compensated temporal filterings, then these subbands are encoded as bitstreams by the EZBC entropy coder. Since the rate of a subband indicates the length of the corresponding coded bitstream, the rate allocation problem in a wavelet encoding process is equivalent to how to decide the lengths of the bitstreams adequately.

The major advantage of the wavelet-based coding in the rate allocation problem is that the distortion of a subband is not affected by the rates allocated to the other subbands. As described in Appendix A.2, this property greatly reduces the complexity of deriving the solution to the rate allocation problem. In addition, the wavelet based coding is a natural solution to SVC. If the necessary subbands are provided, the same coding structure can be used to obtain any resolution.

As far as we know, in the current status, the coding efficiency of a wavelet codec cannot outperform that of an H.264/SVC codec. The major reason is that the texture represented by the discrete integer transform is more sparse than that represented by the

(39)

Figure 2.2: Wavelet Based Coding Structure.

wavelet transform. Moreover, the coding efficiency of a wavelet codec is limited by the non-integer computation of the wavelet transforms in the codec implementation.

In H.264/SVC, three features are newly introduced to improve the coding efficiency [11]. These features are inter-layer intra prediction, inter-layer motion prediction, and inter-layer residual prediction. Inter-layer intra prediction can only be applied to the macroblocks with the INTRA coded co-located block in the referred layer. If inter- layer intra prediction is applied to a macroblock, the predicting signal of the macroblock is generated by up-sampling the reconstructed signal of the co-located block in the base layer. Inter-layer motion prediction is used to remove the redundancy of the side information between the dependent layers. If inter-layer motion prediction is applied to a macroblock, its modes and motion vectors are predicted by those of the co-located block in the base layer. When inter-layer residual prediction is applied to a macroblock, the reconstructed residual of the co-located block in the base layer is used to predict the residual, which is obtained after INTER or INTRA prediction.

The SVC coding structure consists of dependent layers, and a layer usually corresponds to a specific spatial resolution. An example of two spatial layers is given in Fig-

(40)

ure 2.3. It is possible for two consequent layers to support the same spatial resolution, the case is considered as the coding structure supporting the coarse-grain scalability. In each spatial layer, the basic concepts of motion-compensated prediction (INTER) and intra prediction (INTRA are exploited as the single layer coding, except the redundancy between the dependent layers is removed by the newly introduced inter-layer coding features. The coded stream is further refined to support the medium-/fine-grain scalability. In the last stage, these streams are combined by a multiplexer before they are used in the application.

Figure 2.3: H.264/SVC Coding Structure [4].

According to the results reported in [16], the coding efficiency of H.264/SVC is highly improved and approaches that of the single layer coding. However, since the inter-layer predictions are used to remove the redundancy between the layers, the distortion of a layer is dependent on the rates allocated to the referred layers. This causes that the rate-distortion relations between the dependent layers must be analyzed and clarified before we can actually solve the rate allocation problem. Despite the increased complexity of the rate allocation problem, the improved coding efficiency provided by H.264/SVC is significant. As a consequence, H.264/SVC is possibly the most favorable solution to scalable video coding.

(41)

CHAPTER 3

RATE ALLOCATION FOR WAVELET BASED SVC

In this chapter, we solve the rate allocation problem for wavelet based scalable video coding. In Section 3.1, we review the MCTF-based 2D+t wavelet coding scheme, and in Section 3.2, we formulate the rate-distortion function of a wavelet-based scalable video coder. In Section 3.3, we introduce methods to solve the bit-allocation problem when preferences are known; and in Section 3.4, we present a min-max based approach for the case where the subscriber’s preference is unknown. Section 3.5 reports the experimental results andcontains some concluding remarks.

3.1 MCTF-based 2D+t Wavelet Codec

In a 2D+t wavelet coding scheme, video frames are first decomposed spatially into multiple spatial subbands by the undecimated wavelet transform, after which MCTF is applied to each subband separately. In the following, we review the decomposition of a 2D+t scheme.

Let F₀₀ be the input frame, and let H¯k,0 and H¯k,1 be the analysis matrices to be used in the k-th spatial decomposition. Then, the k-th spatial wavelet decomposition can be written as

⎡

⎢⎣F_k0 F_k1 F_k2 F_k3

⎤

⎥⎦ =

⎡

⎢⎣ H¯k,0

H¯k,1

⎤

⎥⎦F(k−1)0

H¯_k^T_,0 H¯_k^T_,1

, (3.1)

where F_k0, F_k1, F_k2, and F_k3 represent the LL, HL, LH, HH subbands respectively. If necessary, the subband F_k0can be further decomposed by the(k + 1)-th analysis matri- cesH¯k+1,0andH¯k+1,1.

To represent the temporal filtering scheme as a matrix computation process, we first represent a spatial subband F of size N× N as an N²× 1 vector f ; for example, the mapping between f and F can be realized by letting f[i ∗ N + j] = F[i, j]. MCTF is

(42)

performed by using a temporal decomposition lifting structure. The MCTF motion vectors, which are computed before the temporal decomposition step, can be represented by an N²× N² matrix P_m^x^,y, where x and y are, respectively, the predicting frame and the predicted frame in the m-th temporal decomposition [21]. Figure 3.1 shows an ex- ample of two-level temporal decomposition by applying the lifting structure on eight frames, where the frames outside the boundary of the GOP can be dealt with by pasting blank frames or by changing the bi-directional prediction mode to the uni-directional prediction mode.

In MCTF, motion compensation is performed by using even frames f²ⁱ to predict odd frames f²ⁱ⁺¹. In the m-th temporal decomposition, the coefficients h²ⁱ⁺¹_m in the higher frequency subband are obtained as follows:

h²ⁱ_m⁺¹= f_m²ⁱ₋₁⁺¹− (

∑

j

Hm[2 j]Pm^{2 j}^,2i+1f_m−1^{2 j} ), (3.2)

whereHmrepresents the coefficients of the temporal analysis filter. The inverse motion vector matrix U_m^{2i+1,2 j} can be calculated from the motion vector matrix P_m^{2 j,2i+1} [22].

The decomposed L-frame, obtained after the updating stage, is computed as follows:

l²ⁱ_m = f_m−1²ⁱ + (

∑

j

Hm[2 j + 1]U_m^{2 j+1,2i}h^{2 j+1}_m ). (3.3)

Figure 3.1: An Example of Two-level Temporal Wavelet Decomposition.

(43)

The L-frames can be taken as the input for the next round of temporal decomposi- tion. Let S_mdenote the number of input frames for the m-th temporal decomposition. To ensure that the notation is consistent between the input frames of the current temporal decomposition and the L-frames of the previous decomposition, we let f_mⁱ = lm²ⁱ for the first half of the frames, and f_mⁱ = h²m^(i−S^m^/2)+1 for the reminder. Each subband can be indexed by(xy,mn), where xy represents the y-th spatial subband after the x-th spatial decomposition, and mn represents the n-th temporal subband after the m-th temporal decomposition. Figure 3.2 shows an example of the spatial-temporal subband indices after the two-level spatial filtering and three-level temporal filtering are applied to a group of pictures (GOP) comprised of eight frames. In the example, two spatial and three temporal decompositions are applied, and then we obtain three spatial resolutions (in (a)) and four temporal resolutions, respectively.

Figure 3.2: An Example of Indexing the Subbands.

(44)

3.1.1 Spatial Temporal Subband Weighting

In wavelet coding, we usually execute the bit allocation algorithm based on the distortion in the wavelet domain and measure its performance in the pixel domain. The subband weighting method ensures the equivalence of the quantization errors in the wavelet and pixel domains. In wavelet image coding, the weighting can be derived from the bi-orthogonal wavelets, as shown in [23,24]. Solving the subband weighting problem in wavelet video coding is more complicated because the discrepancy between the variances in the wavelet and pixel domains is caused by the wavelets as well as the MCTF process, which imposes a different connectivity status (single-connected, multiple-connected, or unconnected) on each pixel during motion prediction.

A possible solution to the above problem is to assign a weighting coefficient to the quantization error on each spatial-temporal subband, as proposed in [15]. In scalable video coding, users can subscribe to resolutions with different levels of spatial and temporal scalabilities; therefore, the subband’s weighing coefficients for the resolutions may be different. We use w^s_(xy,mn)^,t to denote the weighting coefficient of the subband (xy,mn) for the spatial resolution s and temporal resolution t. The coefficient value is computed as follows:

w^s_(xy,mn)^,t = w^s_(xy)w^t_(mn), (3.4) where w^s_(xy) is the spatial weighting factor for spatial resolution s and w^t_(mn) is the tem- poral weighting factor for temporal resolution t. The detailed derivation of the values of the spatial and temporal weighting factors can be found in [15].

The weighted distortion of a GOP in the wavelet domain is obtained by

(xy,mn)∈S(s,t)

∑

w^s_(xy,mn)^,t D^w_(xy,mn), (3.5)

where D^w_(xy,mn)is the variance of a subband distortion in the wavelet domain, and S(s,t) denotes the set of subbands used to reconstruct the GOP of spatial resolution s and temporal resolution t.

(45)

3.2 Formulation of the Rate-Distortion Function

In this section, we formulate the rate-distortion function of a wavelet-based scalable video coder. We use non-negative integers to index the spatial and temporal resolutions. The lowest resolution is indexed by 0, and a higher resolution is indexed by a larger number. Let p and q denote the number of spatial and temporal decompositions respectively; then, the spatial resolution index s and temporal resolution index t are in the ranges{0,1,··· ,p} and {0,1,··· ,q} respectively. Note that we use (xy,mn) to de- note the spatial-temporal subband, which is the y-th spatial subband after the x-th spatial decomposition and the n-th temporal subband after the m-th temporal decomposition.

Thus, if we let Ws,t denote the set of subbands used to reconstruct a video of spatial resolution s and temporal resolution t, then, W_s_,t is comprised of

{(xy,mn)|x = p − s + 1,...,p;y = 1,2,3;

m= q −t + 1,...,q;n = 2^q^−m,...,2^q^−m+1− 1}∪

{(p0,mn)|m = q −t + 1,··· ,q;n = 2^q−m,··· ,2^q−m+1− 1}∪

{(xy,q0)|x = p − s + 1,...,p,y = 1,2,3} ∪ {(p0,q0)}. (3.6)

Figure 3.2 shows an example of two spatial and three temporal resolutions. In the example, according to Equation (3.6), W_0,0 is {(20,30)}; W1,0 is {(20,30), (21,30), (22,30), (23,30)}; W0,1 is {(20,30), (20,31)}; W1,1 is {(20,30), (21,30), (22,30), (23,30), (20,31), (21,31), (22,31), (22,32)} and so on. The lowest resolution is with W_0,0 and the highest resolution is with W_2,3 which contains all the subbands in Figure 3.2(a) and Figure 3.2(b).

We also assume that all the subscribers to the same quality resolution r receive the same bitstream; therefore, a subscriber to resolution [s,t,r] can decode the substream corresponding to the subbands that support the spatial resolution s and the temporal resolution t. For each quality resolution r, letβr[(xy,mn)] represent the number of bits assigned to subband (xy,mn) for a quality resolution. This assumption simplifies our

(46)

bit-allocation analysis significantly because we only need to consider the distribution of the bits for the same quality resolution. Obviously, we have

βr+1[(xy,mn)] ≥βr[(xy,mn)] (3.7)

for each subband(xy,mn). Let br denote the maximum number of bits for all the sub- bands of quality resolution r in a GOP, and let W be the set of all subbands; then, the bit constraint for the quality resolution r can be written as

b_r = _z∈W

∑

^β^r^[z], ^(3.8)

where z ranges over all subbands in W . Recall that the average distortion of all the subbands of a frame in the resolution[s,t,r] is represented by ¯D^w_[s,t,r]. We introduce a new notation Θ(s,t,βr) for ¯D^w_[s,t,r] to explicitly represent the average distortion in the wavelet domain as a function of B_r. According to Equation (3.5), we have

Θ(s,t,βr) = 1 2^t

∑

z∈Ws,t

w^s,t_z D¯^w_z(βr[z])], (3.9)

where ¯D^w_z(βr[z]) indicates the average distortion of subband z encoded withβr[z] bits.

Substituting the subbands support for the resolution[s,t,r], as defined in Equation (3.6), for W_s,t in Equation (3.9), we obtain

Θ(s,t,β^r) = 1 2^t(

∑

q m=q−t+1

2^q^−m+1−1 n=2

∑

^q^−m

∑

p x=p−s+1

∑

3 y=1

w^s,t_(xy,mn)D¯^w_(xy,mn)(β^r[(xy,mn)])

+

∑

q m=q−t+1

2^q^−m+1−1 n=2

∑

^q^−m

w^s,t_(p0,mn)D¯^w_(p0,mn)(β^r[(p0,mn)])

+_x=p−s+1

∑

^p _y=1,2,3

∑

^w^s^(xy,q0)^,t ^D^¯^w^(xy,q0)⁽^β^r^[(xy,q0)])

+w^s,t_(p0,q0)D¯^w_(p0,q0)(βr[(p0,q0)])). (3.10)

Let v be the number of quality resolutions indexed from 0··· ,v − 1, and let

{β0,··· ,βv−1} be the bit allocation profile. We can represent Equation (2.9) explicitly asD(β0,··· ,βv−1) to indicate the dependence of the average distortion of a GOP on the

(47)

bit allocation profile. Then, we obtain

D(β0,··· ,βv−1) = ^v−1_r=0

∏

^p−1

∏

_s=0^q−1_t=0

∏

^D^¯^μ^[s,t,r]^[s,t,r] ^(3.11)

= ^v_r=0

∏

⁻¹^p

∏

_s=0⁻¹^q_t=0

∏

⁻¹^Θ(s,t,^β^r⁾^μ^[s,t,r] ^(3.12)

= ^v−1

∏

r=0Dr(βr), (3.13)

where Dr(βr) is the average distortion of quality resolution r when the subscriber’s preference factorμ_[s,t,r] is considered in weightingΘ(s,t,βr).

The rate-distortion problem (P) can now be formulated as finding the bit alloca- tion profile{β0,··· ,βv−1} that satisfies the constraints in Equations (3.7) and (3.8) and minimizing the distortion function specified in Equation (3.13):

minD(β⁰,β¹,··· ,β^v−1) (3.14) subject to ∑z∈Wβi(z) = bi, for i = 0,··· ,v − 1;

and βi−1(z) ≤βi(z) for i = 1,··· ,v − 1.

3.3 Solving Rate-Allocation with Known Preferences

The optimal bit-allocation problem(P) can be solved by solving a sequence of bit- allocation sub-problems(Pr), with quality resolution r = 0,··· ,v− 1. The sub-problem (Pr) is defined as follows:

minD(β0,β1,··· ,βr−1,βr,βr,··· ,βr) (3.15) subject to ∑z∈Wβⁱ(z) = bi, for i = 0,··· ,r;

and βi−1(z) ≤βi(z) for i = 1,··· ,r,

where W is the set of all subbands, and {bi} is a given non-decreasing sequence that corresponds to the bit constraints. The problem (Pr) allocates bits from the quality resolution 1 to r; hence, all the subscriptions for a quality resolution> r will use the bit allocation result of the quality resolution r. Thus, we haveβi=βrfor i= r+1,··· ,v−1

(48)

in Equation (3.15). The optimal bit-allocation problem (P) can be solved by solving (P0), followed by (P1) based on the solution of (P0), and so on up to solving (Pv−1).

In the following subsections, we propose two methods to solve(Pr). The first finds the upper bound of (Pr) by a Lagrangian-based approach, and the second finds the exact solution by using the less efficient dynamic programming approach.

3.3.1 Lagrangian-based Solution

The bit-allocation problem is usually analyzed by the Lagrangian multiplier method.

By assuming that∏^bi=af_i= 1 for any function fiwith b> a, the objective of (Pr) can be re-written as

D(β0,...,βr−1,βr,...,βr)

= ^r

∏

⁻¹

k=0Dk(βk)^v

∏

⁻¹

k=rDk(βr) (3.16)

= ^p−1

∏

_s=0^q−1

∏

_t=0^{^r_k=0

∏

⁻¹^Θ(s,t,^β^k⁾^μ^(s,t,k)^v

∏

_k=r⁻¹^Θ(s,t,^β^r⁾^μ^(s,t,k)^}. ^(3.17)

Because∑s∈S,t∈T,r∈Rμ[s,t,r] = 1 (See Equation (2.6)), by applying the generalized geometric mean - arithmetic mean inequality to Equation (3.17), we can obtain its upper bound as follows:

D(β0,...,βr−1,βr,...,βr) ≤

p−1 s

∑

=0

q−1 t

∑

=0{^r

∑

⁻¹

k=0μ_(s,t,k)Θ(s,t,βk) +^v

∑

⁻¹

k=rμ_(s,t,k)Θ(s,t,βr)}.

(3.18) Note that the solution of (P_r) is based on the bit allocation results from resolution 0 to resolution r− 1. Thus, the first term in Equation (3.18) is a constant C. We have

D(β0,...,βr−1,βr,...,βr) ≤ C +^p−1

∑

s=0 q−1

∑

t=0 v−1

∑

k=rμ(s,t,k)Θ(s,t,βr). (3.19)

(49)

Now we can find the solution for the problem (P_r⁺), which is the upper bound of the problem (P_r). The problem (P_r⁺) is defined as

min_β_rΩr(βr) = ∑^p_s=0⁻¹∑^q_t=0⁻¹∑^v−1k=rμ_(s,t,k)Θ(s,t,βr) (3.20) subject to ∑z∈Wβr(z) = br.

After substituting Equation (3.10) into Equation (3.20) forΘ(s,t,βr), we have Ωr(βr) = ^p−1_s=0

∑

^q−1_t=0

∑

₂¹^t⁽^v

∑

⁻¹

k=rμ_(s,t,k)) {

∑

^q

m=q−t+1

2^q^−m+1−1 n=2

∑

^q^−m

∑

p x=p−s+1

∑

3 y=1

w^s_(xy,mn)^,t D¯^w_(xy,mn)(βr[(xy,mn)])

+

∑

^q

m=q−t+1

2^q^−m+1−1 n=2

∑

^q^−m

w^s_(p0,mn)^,t D¯_(p0,mn)(βr[(p0,mn)]

+

∑

p

x=p−s+1

∑

y=1,2,3

w^s,t_(xy,q0)D¯^w_(xy,q0)(βr[(xy,q0)])

+ w^s,t_(p0,q0)D¯^w_(p0,q0)(β^r[(p0,q0)])}, (3.21)

where ₂¹t∑^v_k=r⁻¹μ(s,t,k) is the weighting on resolution [s,t,r] in terms of the preference factors. We use

g[s,t,r] = 1 2^t

v−1

k

∑

=rμ_(s,t,k) ^(3.22)

to denote the preferred weighting for the resolution [s,t,r]. Then, we rearrange the summation order of the four terms in Equation (3.21) such that the first term becomes

∑

q m=2

2^q^−m+1−1 n=2

∑

^q^−m

∑

p x=2

∑

3

y=1{_s=p−x+1^p−1

∑

_t=q+m−1^q−1

∑

^g^[s,t,r]w^s,t^(xy,mn)^{} ¯D}^w^(xy,mn)⁽^β^r[(xy,mn)]), (3.23) the second term becomes

∑

q m=2

2^q^−m+1−1 n=2

∑

^q^−m

{^p−1_s=0

∑

_t=q+m−1^q−1

∑

^g^[s,t,r]w^s^(p0,mn)^,t ^{} ¯D}^w^(p0,mn)⁽^β^r[(p0,mn)]); (3.24)

and the third and the fourth terms become

∑

p x=2

∑

3

y=1{_s=p−x+1^p−1

∑

^q−1_t=0

∑

^g^[s,t,r]w^s^(xy,q0)^,t ^{} ¯D}^w^(xy,q0)⁽^β^r[(xy,q0)]), (3.25)

對可伸展式視訊編碼之最佳碼率分配

國立臺灣大學電機資訊學院電子工程學研究所 博士論文

Graduate Institute of Electronics Engineering College of Electrical Engineering & Computer Science

National Taiwan University doctoral dissertation

對可伸展式視訊編碼之最佳碼率分配

Optimal Rate Allocation for Scalable Video Coding

彭冠舉 Peng, Guan-Ju

指導教授：陳少傑 博士 黃文良 博士

Advisor: Chen, Sao-Jie, Ph.D.

Hwang, Wen-Liang, Ph.D.

中華民國 101 年 1 月

January, 2012

中 文 部 分

英 文 部 分

∑

∑

∑

∑

∑

∑

∑

∑

∑

∏

∏

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∏

∏

∏

∏

∏

∏

∏

∏

∏

∏

∏

∏

∏

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

∑

國立臺灣大學電機資訊學院電子工程學研究所博士論文

指導教授：陳少傑博士黃文良博士

中文部分

英文部分