國 立 交 通 大 學
電子工程學系 電子研究所
博 士 論 文
用於H.264/MPEG-4 AVC可調式視訊編碼
標準之快速編碼演算法設計
Fast Encoding Algorithm Design for
H.264/MPEG-4 AVC Scalable Video Coding
Standard
研 究 生: 林 鴻 志
指導教授: 杭 學 鳴
用於H.264/MPEG-4 AVC可調式視訊編碼標準之快速編碼
演算法設計
Fast Encoding Algorithm Design for H.264/MPEG-4 AVC
Scalable Video Coding Standard
研 究 生:林 鴻 志 Student: Hung-Chih Lin
指導教授:杭 學 鳴 博士 Advisor: Dr. Hsueh-Ming Hang
國 立 交 通 大 學
電子工程學系 電子研究所
博 士 論 文
A Dissertation
Submitted to Department of Electronics Engineering & Institute of Electronics
College of Electrical and Computer Engineering National Chiao Tung University
in partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy in
Electronic Engineering June 2010
Hsinchu, Taiwan, Republic of China
用於
用於
用於
用於H.264/MPEG-4 AVC可調式視訊編碼
可調式視訊編碼
可調式視訊編碼
可調式視訊編碼
標準之快速編碼演算法設計
標準之快速編碼演算法設計
標準之快速編碼演算法設計
標準之快速編碼演算法設計
研究生: 林鴻志 指導教授: 杭學鳴 博士
國立交通大學 電子工程學系 電子研究所博士班
摘
摘
摘
摘
要
要
要
要
為了使視訊影像能夠穩健地在異質網路環境中傳輸,H.264/MPEG-4 AVC 視訊編碼標準 (H/264/AVC)已擴展出可調式視訊編碼標準(H.264/SVC)。在 H.264/SVC 視訊編碼標準中,主要 提供了三種可調特性,包含了時間上、空間上與畫質上之可調特性。H.264/SVC 視訊編碼標準 能夠在一次壓縮視訊影響的前提下,根據不同的儲存、傳輸需求,擷取出部分位元流(bit-stream) 並解碼出較低畫面率或低解析度之視訊影像。H.264/SVC 視訊編碼標準採用了層次時域預測 (hierarchical temporal prediction)編碼架構達到時間可調特性,含有兩個單方向預測與一個雙方 向預測(bi-directional prediction, BI prediction)。此外,亦採用了層級編碼架構(layered coding approach)來實現空間與畫質之可調性,而各編碼層以層次時域預測為其基本結構。為了讓編碼 效能達到最佳,H.264/SVC 視訊編碼標準會評估編碼參數的所有組合可能,其中包含了 H.264/AVC 視訊編碼標準的編碼工具組與新提出的層間預測(inter-layer mechanism)機制。然 而,此選定編碼參數作法會招致相當龐大之運算量。根據實驗數據指出,模式決定(mode decision)過程與其所需要之動作估計(motion estimation)之程序占了絕大部份之編碼運算量。因 此,發展用於減少 H.264/SVC 視訊編碼標準運算複雜度之快速演算法是必要的。
首先,針對二指數(dyadic)層次時域預測架構,我們提出了一套有效率的選擇時域預測模 式(temporal prediction type)之快速演算法。根據 16x16 切割模式所選定之最佳時域預測模式,
利用其高度相關繼承特性,可以有效地避免在大切割模式中(含 16x8、8x16 與 8x8)的非必要之 雙方向預測計算。此外,我們也謹慎地找出單方向預測與雙方向預測,兩者的誤差(distortion) 與動作碼率(motion rate)數值之關係,用以設定出一組適應性調整之臨界值,排除不必要之雙 方向預測運算。而在小切割模式中(含 8x4、4x8 與 4x4),根據我們的分析,不僅其最佳之時域 預測模式可以參考 8x8 切割模式而得知,而且雙方向預測模式在編碼效能提升上是非常有限 的。因此,這些分析可以有效地用來屏除層次時域預測架構中的無效之雙方向預測計算。 接著,在全幀內(intra-only)預測之可調編碼架構下,因內部 4x4 預測(intra 4x4)與內部 8x8 預測(intra 8x8)之誤差與碼率(rate)在層級間具有對數線性(log-linear)之關係。利用此特性與基本 層(base layer)所選定之最佳內部預測模式,可以大量地減少加強層(enhancement layer)之內部預 測測試個數。此外,在較平滑之影像區域,我們保留了內部 16x16 預測(intra 16x16)的評估效 應。 最後,在幀間(inter)預測之可調編碼架構下,考慮了時間與畫質兩種可調性組合,提出了 一幀內/幀間模式與動作向量選擇演算法。我們觀察不同切割模式的編碼效能與其切割模式在 層級之間的條件機率分布,對於幀內模式而言,基於參考基本/參考層(reference layer)之資訊, 加強層可以節省至少一半以上的測試個數;另一方面,對於幀間模式而言,藉由層級間量化參 數(quantization parameter)的差異,調整加強層所要查詢的切割模式表。另外,為了減少動作搜 尋的計算量,基本層的參考畫面位置也可以被選擇性地使用,而且基本層所選定之動作向量 (motion vector)亦可被拿來當作加強層中的起始搜尋點。 綜合而言,本論文藉由分析與觀察層級之間的高度相關性,排除罕見的編碼模式組合。實 驗數據指出,與 H.264/SVC 之標準參考軟體相比,我們所提出之快速演算法可以在維持鮮少 之效能損失下,節省 65%~85%之編碼時間。
Fast Encoding Algorithm Design for
H.264/MPEG-4 AVC Scalable Video Coding
Standard
Student: Hung-Chih Lin Advisor: Dr. Hsueh-Ming Hang
Department of Electronics Engineering and Institute of Electronics
National Chiao Tung University
Abstract
To enable robust video transmission over heterogeneous networks, the H.264/MPEG-4 AVC (H.264/AVC) has developed an extension of scalable video coding scheme (H.264/SVC). In the H.264/SVC, there are three main modalities of scalability, consisting of temporal, spatial, and quality scalability. The H.264/SVC can compress the video signal once but enable partially decoding the encoded bit-streams with lower temporal frame rate or spatial resolutions, depending on the storage and transmission requirements. To achieve the temporal scalability, the H.264/SVC uses the coding structure of the hierarchical temporal prediction, in which there are two uni-directional predictions and one bi-directional (BI) prediction. In addition, the spatial and quality scalabilities are realized by adopting the layered coding approach, where the hierarchical temporal prediction forms a basic coding structure in each coding layer. In order to provide high coding efficiency, the H.264/SVC exhaustively evaluates all possible combinations of encoding parameters, including the conventional coding tools in the H.264/AVC and the novel inter-layer prediction mechanism. However, the procedure of selecting optimal coding parameters dramatically results in huge computational
estimations significantly dominates the overall encoding time. Hence, it is necessary to develop fast encoding algorithms to reduce the encoding computations in the H.264/SVC.
First, we propose a fast algorithm that efficiently selects the temporal prediction type for the dyadic hierarchical-B prediction structure in the H.264/SVC temporal scalable video coding. Referring to the best temporal prediction type of 16x16, we utilize the strong correlations of prediction type inheritance to eliminate the unnecessary computations for the BI prediction in the finer partitions, 16x8/8x16/8x8. In addition, we carefully examine the relationship of motion-rate costs and distortions between the BI and the two uni-directional temporal prediction types. As a result, we construct a set of adaptive thresholds to remove the unnecessary BI calculations. Moreover, our analysis points out that the coding efficiency of the BI prediction is limited in small partitions. For the block partitions smaller than 8x8, one of the two uni-directional temporal predictions is skipped based upon the information of an 8x8 partition. Hence, these analyses can be used to efficiently reduce the extensive computations burden in performing the BI prediction.
Second, we make use of the log-linear rate-distortion relationship of inter-dependent layers to predict the better performer among the Intra4x4 and Intra8x8 prediction types at the enhancement layers for intra-only scalable video coding. Based upon the base-layer chosen prediction type, we can further reduce the number of candidate modes. In addition, to ensure the best trade-off between complexity and coding efficiency, the Intra16x16 prediction is retained and enabled only for coding high-resolution videos with smooth image contents.
Finally, we provide a layer-adaptive intra/inter mode decision algorithm and a motion search scheme for the hierarchical B-frames in the H.264/SVC with combined coarse-grain quality scalability (CGS) and temporal scalability. We examine the rate-distortion performance contributed by different coding modes at the enhancement layers and the mode conditional probabilities at different temporal layers. For the intra prediction on inter frames, the number of Intra4x4/Intra8x8 prediction modes can be reduced by 50% or more, based on the reference/base layer intra prediction directions. For the enhancement-layer inter prediction, the look-up tables containing inter prediction
candidate modes are designed to use the macroblock coding mode dependence on and the reference/base layer quantization parameters ( ). In addition, to avoid checking all motion estimation reference frames, the base-layer reference frame index is selectively reused. And according to the enhancement-layer macroblock partition, the base-layer motion vector can be used as the initial search point for the enhancement-layer motion search.
In conclusion, our proposed algorithms efficiently eliminate the unlikely combinations of coding options. The experiments show that our approaches can reduce 65%~85% encoding time with a similar coded quality, as compared to the reference software of the H.264/SVC.
誌
誌
誌
誌
謝
謝
謝
謝
首先,我要感謝我的指導教授杭學鳴博士在這六年中給我的諸多指導,不僅在學術 研究的方向上不斷地引領我走在專業領域的先端。六年來,每星期的討論一點一滴地醞 釀我向前進步的潛能,亦讓我培養了獨立解決問題與清楚表達想法的能力。除此之外, 杭老師日常的身教與言教,更是讓學生學習的優良典範。 此外也要感謝彭文孝老師在可調視訊編碼專業領域與論文寫作上的諸多指導與建 議,彭老師對於研究方面的專業與態度,讓我學習到很多視訊壓縮的知識,感謝彭老師 在我疑惑時給予可行的研究方向並提供正確的研究目標。 從碩士班入學、逕讀博士班到現在拿到學位,回想這幾年的點點滴滴,彷彿回到了 記憶裡的時光隧道。研究所初期,為了進入視訊壓縮的研究領域,在杭老師與實驗室學 長的帶領下,開始了我的研究生涯。進入了通訊電子暨訊號處理實驗室(Commlab)這個 大家庭,接觸了許多優秀的學長姐、同學與學弟妹,大家一起做研究、討論功課與聊天 打屁…等,這些酸甜苦辣交雜的回憶,在我的生命中,都是珍貴且不可抹滅的。Commlab 實驗室提供了一個極佳的研究環境,讓我在研究實驗中有充足的資源可以運用。也感謝 實驗室全體成員(張峰城、洪崑健、蔡家揚、蔡彰哲、李志鴻、洪朝雄、蔡崇諺、呂家 賢、黃育彰、陳旻弘、劉建志、陳勇竹、陳治傑、鄭凱庭、陳錫祺、林耀屳、陳威年、 葉尚諭、張順成、江清德、陳豐進、王志偉、曾劭學、陸凱暐、吳崇豪、吳思賢、陳呈 毓、周正偉、李兆軒、柯俊言、翁郁婷…等),營造了一個充滿活力、溫馨與和諧氣氛 的環境,一直是身為實驗室一員所自豪的,感謝實驗室成員這些日子以來的照顧與幫 助,有你們的陪伴,讓我的研究生活過得更多采多姿,也希望實驗室夥伴都能在未來的 人生路上,一切順利。 另外,我也要感謝我的口試委員:成大電機系的楊家輝教授、東華電機系的陳美娟 教授、清大電機系的林嘉文教授、交大電子系的王聖智教授與交大資工系的彭文孝教
授。感謝您們在百忙之中能抽空給予我指導,也因您們的寶貴建議,使得論文能夠更加 完備。 謝謝這些在研究上不斷幫助我的貴人,讓我能夠以更謙卑的態度來看待這個學位, 期待這個博士學位能成為我以後不斷督促我進步的動力。 最後,我要感謝我的家人,感謝你們在這幾年來的照顧、協助與包容。沒有他們的 鼓勵與支持,也就沒有我今天的成就。因此,謹以此論文獻給所有愛我的人與我所愛的 人。 林鴻志 謹誌於台灣新竹交通大學 西元 2010 年 6 月
Table of Contents
摘要 ……… i
Abstract ……… iii
誌謝 ……… vi
Table of Contents ……… viii
List of Figures ……… xi
List of Tables ……… xiii
List of Symbols ……… xv
Chapter 1 Introduction ... - 1 -
Section 1.1 Motivation ... - 2 -
Section 1.2 Research Contributions ... - 3 -
Section 1.3 Dissertation Organization ... - 5 -
Chapter 2 Introduction to H.264/MPEG-4 SVC Coding System ... - 7 -
Section 2.1 H.264/MPEG-4 AVC Architecture ... - 8 -
2.1.1 Architecture Overview ... - 8 -
2.1.2 Basic Coding Tools ... - 9 -
2.1.2.1 Intra Prediction... - 9 -
2.1.2.2 Inter Prediction, Motion Estimation and Motion Compensation ... - 12 -
2.1.2.2.1 Variable Block-Size Motion Compensation... - 12 -
2.1.2.2.2 Hierarchical-B Prediction Structure with Bi-directional Motion Compensation - 13 - 2.1.2.3 Transform, Scaling, and Quantization ... - 16 -
2.1.2.4 In-loop Deblocking Filter ... - 18 -
2.1.2.5 Entropy Coding ... - 20 -
Section 2.2 Additional Coding Tools in H.264/SVC ... - 21 -
2.2.1 Overview of Layered Coding Structure ... - 22 -
2.2.2 Inter-Layer Prediction Tools ... - 23 -
2.2.2.1 Inter-Layer Motion Prediction ... - 24 -
2.2.2.2 Inter-Layer Residual Prediction ... - 25 -
2.2.2.3 Inter-Layer Intra Texture Prediction ... - 25 -
Section 2.3 Rate-Constrained Coder Control in H.264/AVC-Based Video Standards ... - 26 -
2.3.1 Optimization Using Lagrangian Schemes ... - 27 -
2.3.2 Lagrangian Optimization in Hybrid Video Coding... - 28 -
2.3.2.1 Rate-Constrained Motion Estimation – Selection Process in Temporal Prediction Type ... - 30 -
2.3.2.2 Rate-Constrained Mode Decision Process ... - 32 -
Section 2.4 Problem Statement ... - 33 -
2.4.1 Complexity Analysis in H.264/SVC Coder ... - 34 -
2.4.2 Our Goal... - 37 -
Chapter 3 Fast Temporal Prediction Selection in H.264/AVC Temporal Scalable Video Coding - 39 - Section 3.1 Literature Review... - 40 -
Section 3.2 Observations and Analysis on Temporal Prediction at Temporal Enhancement Layers - 44 - 3.2.1 Inheritance of Temporal Prediction Types ... - 44 -
3.2.1.1 Prediction Type Distributions ... - 44 -
3.2.1.2 Elimination of BI for Large Partitions ... - 49 -
3.2.1.3 Consistency of FW and BW in Small Partitions ... - 52 -
3.2.2 Rate-Distortion Contribution by BI ... - 53 -
3.2.3 Rate-Distortion Relationships between Uni-directional Predictions and Bi-directional Prediction ... - 56 -
3.2.3.1 Motion Vector Difference ... - 57 -
3.2.3.2 Motion-Rate Cost ... - 64 -
3.2.3.3 Distortion Realtionship ... - 66 -
Section 3.3 Proposed Schemes – Temporal Prediction Inheritance with Adaptive Thresholds for Bi-directional Prediction Selection ... - 67 -
3.3.1 Adaptive Thresholds ... - 68 -
3.3.2 Algorithm Overview ... - 70 -
3.3.2.1 Early Termination on BI for Large Partitions ... - 73 -
3.3.2.2 Adaptive Prediction Type Selection for Small Partitions ... - 73 -
Section 3.4 Experimental Results and Discussions ... - 74 -
3.4.1 Test Conditions ... - 74 -
3.4.2 Performance Measures ... - 74 -
3.4.3 Performance Comparison with JSVM ... - 76 -
3.4.4 Performance Comparison with State-of-the-art Fast Algorithms ... - 83 -
Chapter 4 Fast Mode Decision Algorithm for Intra-only Scalable Video Coding with Combined Coarse Granular Scalability (CGS) and Spatial Scalability ... - 85 -
Section 4.1 Literature Review... - 86 -
Section 4.2 Statistical Analysis of Intra Predictions ... - 87 -
4.2.1 Mode Correlation of Base and Enhancement Layers ... - 87 -
4.2.2 Rate-distortion Profile of Intra Prediction ... - 89 -
Section 4.3 Proposed Macroblock-Adaptive Rate-Distortion Estimation Algorithm [82] . - 91 - 4.3.1 Algorithm Overview ... - 91 -
4.3.2 Macroblock-Adaptive Rate-Distortion Estimation ... - 93 -
4.3.3 Layer-Adaptive Intra Mode Selection ... - 94 -
Section 4.4 Experiments ... - 95 -
Chapter 5 Fast Mode Selection and Motion Search for Scalable Video Coding with Combined Coarse Granular Scalability (CGS) and Temporal Scalability ... - 100 -
Section 5.1 Literature Review... - 101 -
Section 5.2 Correlations between Base and Enhancement Layers ... - 105 -
5.2.1 Distributions of Intra Prediction Mode in CGS ... - 105 -
5.2.2 Distributions of Inter Prediction Mode in CGS ... - 110 -
5.2.3 Temporal Reference Frames between Coding Layers ... - 118 -
5.2.4 Inter-Layer Residual Prediction in Transform/Pixel Domain ... - 121 -
Section 5.3 Proposed Approaches – Layer-Adaptive Mode Decision and Motion Search- 126 - 5.3.1 Layer-Adaptive Mode Decision for Hierarchical-B Frames... - 132 -
5.3.1.1 Intra Mode Selection ... - 132 -
5.3.1.2 Inter Mode Selection ... - 133 -
5.3.2 Layer-Adaptive Reference Frame and Motion Reuse... - 135 -
Section 5.4 Experiments and Discussions ... - 139 -
5.4.1 Test Conditions ... - 139 -
5.4.2 Performance Measures ... - 140 -
5.4.3 Simulation Results ... - 142 -
5.4.4 Performance Comparison with State-of-the-art Fast Algorithms ... - 149 -
Chapter 6 Conclusions and Future Work ... - 153 -
Section 6.1 Concluding Remarks ... - 153 -
6.1.1 Fast Bi-directional Prediction Selection in H.264/AVC Temporal Scalable Video Coding - 153 - 6.1.2 Fast Mode Decision Algorithm with Macroblock-Adaptive Rate-Distortion Estimation for Intra-only Scalable Video Coding ... - 154 -
6.1.3 Fast Context-adaptive Mode Decision Algorithm for Scalable Video Coding with Combined Coarse-grain Quality Scalability (CGS) and Temporal Scalability ... - 155 -
Section 6.2 Future Work ... - 156 -
Appendix Distribution of the Approximated Distortion ... - 158 -
Bibliography (in order of appearance) ... - 165 -
Curriculum Vitae ………..………. - 175 -
List of Figures
Fig. 2-1 H.264/AVC encoder structure [4] ... - 9 -
Fig. 2-2 Directional prediction modes of intra 4x4 and the reference prediction pixels A to M .... - 10 -
Fig. 2-3 The scan order of sixteen 4x4 sub-blocks ... - 11 -
Fig. 2-4 Prediction modes of intra 16x16 ... - 12 -
Fig. 2-5 Variable block-size macroblock partition... - 13 -
Fig. 2-6 An example of hierarchical-B prediction structure with GOP size = 16 ... - 15 -
Fig. 2-7 Syntax elements and their combinations for the inter-layer prediction in the coarse-grain quality scalability (CGS) [2][7] ... - 23 -
Fig. 2-8 Selection process for choosing the best temporal prediction type ... - 31 -
Fig. 2-9 Flowchart of mode decision at enhancement layer for hierarchical-B frames in JSVM 9.11 [10] ... - 35 -
Fig. 3-1 Distribution of temporal prediction types (FW, BW, and BI) at different temporal enhancement layers ... - 48 -
Fig. 3-2 The performance index for individual hierarchical-B frame ... - 55 -
Fig. 3-3 PDFs and CDFs of the motion vector difference for 16x16 and 8x8 blocks with two selected values ... - 61 -
Fig. 3-4 Distributions of motion-rate costs and ... - 63 -
Fig. 3-5 Fast selection algorithm for temporal prediction types ... - 70 -
Fig. 3-6 Comparisons in rate-distortion curve with GOP = 8 ... - 80 -
Fig. 3-7 Comparisons in rate-distortion curve with GOP = 16 ... - 81 -
Fig. 4-1 Block address mapping for intra direction mode: (a) CGS, (b) spatial scalability with 1-to-1 mapping, and (c) spatial scalability with 1-to-4 mapping ... - 88 -
Fig. 4-2 Probability profiles of “similarity” between coding layers: (a) CGS and (b) spatial scalability. (FOREMAN) ... - 89 -
Fig. 4-3 Rate-distortion profiles between CGS layers for (a) Intra4x4 and (b) Intra8x8 (FOREMAN) ... - 90 -
Fig. 4-4 Fast mode decision algorithm with rate-distortion estimation and layer-adaptive mode selection ... - 92 -
Fig. 4-5 Inter-layer dependency settings of H.264/SVC encoder [2] for (a) CGS, (b) dyadic spatial scalability, and (c) combined scalability ... - 96 -
Fig. 5-1 Distribution of intra prediction types at CGS enhancement layers ... - 108 -
Fig. 5-2 One-to-one block address mapping of CGS ... - 108 -
Fig. 5-3 Similarity probability profiles of intra direction mode at CGS enhancement layer with poor-quality base layer ( ) and high-quality base layer ( ) ... - 109 - Fig. 5-4 Conditional probability of inter partition mode at CGS enhancement layers for ,
between 20 to 40, and GOP size = 16 ... - 114 - Fig. 5-5 Four regions representing different degrees of mode correlations between coding layers- 117 - Fig. 5-6 Agreement in selecting reference frames between base layer and enhancement layer ... - 120 - Fig. 5-7 Inter-layer dependency structure in our scheme: (a) two-layer case, and (b) four-layer case- 127 - Fig. 5-8 Flowchart of the proposed inter mode decision algorithm for CGS enhancement layers- 127 - Fig. 5-9 Layer-adaptive mode set selection ... - 128 - Fig. 5-10 Comparison of rate-distortion performance of JSVM 9.11 [10] at enhancement layers- 130 - Fig. 5-11 Layer-adaptive selection in reference frame index and initial search point for
hierarchical-B frames ... - 131 - Fig. 5-12 Rate-distortion curves of JSVM 9.11 [10] and our approaches ... - 147 - Fig. A-1 The average test-statistic value for individual hierarchical-B frame…....………… - 162 -
List of Tables
Table 2-1 Determination of Boundary-Strength [27] ... - 19 -
Table 2-2 Complexity ratio compared to IPPP coding structure (for a GOP) ... - 34 -
Table 3-1 Conditional probabilities of , , and ... - 51 -
Table 3-2 Conditional probabilities of and ... - 52 -
Table 3-3 Average for 16x16, 8x8, and 4x4 blocks in each temporal enhancement layer (in percentage) ... - 55 -
Table 3-4 Optimal value for the linear regression model (3.11) ... - 65 -
Table 3-5 value in and value for derivation of ... - 70 -
Table 3-6 Testing conditions ... - 74 -
Table 3-7 Individual time saving contributed by TP and AT ... - 77 -
Table 3-8 Performance comparisons with JSVM 9.11 [10] when GOP size is 8 ... - 78 -
Table 3-9 Performance comparisons with JSVM 9.11 [10] when GOP size is 16 ... - 79 -
Table 3-10 Overall time saving with various values ... - 83 -
Table 4-1 Look-up table for layer-adaptive intra mode selection ... - 94 -
Table 4-2 Testing conditions ... - 96 -
Table 4-3 Performance comparisons ... - 98 -
Table 4-4 Layer complexity ratio of enhancement-layer encoding time to base-layer encoding time- 99 - Table 5-1 Turning off the inter-layer residual prediction in transform domain for hierarchical-B frames (JSVM 9.11 [10]) ... - 122 -
Table 5-2 Turning off the inter-layer residual prediction in pixel domain for hierarchical-B frames (JSVM 8.0 [77]) ... - 123 -
Table 5-3 Encoding procedures on the hierarchical-B frames at CGS enhancement layers ... - 124 -
Table 5-4 Coding type agreement between base layer and enhancement layer in hierarchical-B frames ... - 129 -
Table 5-5 Candidate modes of inter prediction for ... - 129 -
Table 5-6 Candidate modes of inter prediction for ... - 130 -
Table 5-7 Candidate modes of sub-MB of inter prediction for all values ... - 130 -
Table 5-8 Testing conditions ... - 139 -
Table 5-9 Average time saving of MD and MR/RF ... - 143 -
Table 5-10 Performance comparisons with setting of ... - 144 -
Table 5-11 Performance comparisons with setting of ... - 145 -
Table 5-12 Average complexity ratio of the base layer to one CGS enhancement layer ... - 146 -
Table 5-14 Performance comparisons with Li’s methods [80] and [88] ... - 152 - Table 5-15 Performance comparisons with Ren’s method [89] ... - 152 - Table A-1 The average Kolmogorov-Smirnov test-statistic value for temporal enhancement
layer …....………..……… - 161 - Table A-2 The average test-statistic value for temporal enhancement layer…....….……… - 161 -
List of Symbols
(in order of appearance) BI Bi-directional temporal predictionCGS Coarse-grain scalability Quantization parameter
GOP Group of picture
The -th temporal layer ( : temporal base layer; : temporal enhancement layer where )
FW Forward temporal prediction BW Backward temporal prediction
Initial quantization parameter
Quantization parameter of temporal layer ABT Adaptive block-size transforms
KLT Karhunen-Loeve transform
DCT Discrete cosine transform Quantization step size
Boundary strength used in the loop filter CAVLC Context-adaptive variable length coding CABAC Context-adaptive binary arithmetic coding
Motion vector
Temporal prediction type (FW, BW, or BI)
The Lagrangian cost of temporal prediction in rate-constrained motion estimation
Distortion measured as the sum of the absolute differences (SAD) The number of bits representing motion vector(s)
The Lagrangian multiplier in rate-constrained motion estimation Block partition mode
The Lagrangian cost of partition mode in rate-constrained mode decision Distortion measured as the sum of the absolute differences (SAD)
The number of bits resulted from entropy coding
The Lagrangian multiplier in rate-constrained mode decision The set of uni-directional predictions (FW and BW)
The sub-block (pixel set) specified by block mode Predictive motion vector
The pixel values of the forwardly reconstructed frame The pixel values of the backwardly reconstructed frame The motion vector found by FW
The motion vector found by BW The motion vectors found by BI The motion-rate cost defined by The set of all possible partition modes
MER Motion estimation dedicated to the motion search with residual prediction
MEM Motion estimation dedicated to the motion search with motion prediction
MER+M
Motion estimation dedicated to the motion search with both residual and motion predictions
MEO Motion estimation without residual and motion predictions
The probability of both 16x16 partition and partition mode not belonged to
BI, defined by , where
The probability of temporal prediction inheritance for small block partitions,
defined by , where
The relative rate-distortion improvement of the best temporal prediction
The sum of the relative rate-distortion improvement from those blocks selecting The temporal prediction
BI performance index of blocks
The motion vector difference (Euclidean distance) of and , where
PDF Probability distribution function CDF Cumulative distribution function
The sum of and An estimation of
The slope value representing the linear relationship of motion-rate costs
An approximation of , in which the prediction signal is the average of the two reference blocks found by FW and BW
The x-direction motion vector found by temporal prediction , where The y-direction motion vector found by temporal prediction , where
The prediction error in the corresponding location , defined by An upper bound of which is the average of and
Gamma distribution where is the shape parameter and is the scale parameter
An estimation of by taking the mean value of a Gamma distribution The inverse CDF of a Gamma distribution
The ratio of to for partition mode
An adaptive threshold (for partition mode ) used to eliminate inefficient BI computation
BDP (dB) The averaged Y-PSNR loss by the Bjontegaard metric BDR (%) The averaged bit-rate increase by the Bjontegaard metric
The overall time saving in encoding process
The additional computation of our approach as compared to the IPPP coding structure in JSVM reference software
value at the base layer
value at the enhancement layer
Denote the intra prediction type, Intra4x4 or Intra8x8
The distortion of an enhancement-layer macroblock for intra type and the enhancement-layer is
An estimation of
The decay ratio of distortion between CGS intra-coding layers
The rate of an enhancement-layer macroblock for intra type and the enhancement-layer is
An estimation of
The increase ratio of rate between CGS intra-coding layers PSNR difference between our approach and the JSVM Bit-rate difference between our approach and the JSVM The optimal base-layer partition mode
The optimal enhancement-layer partition mode
The optimal finer partition mode when is 8x8 The reference-layer value
The reference frame indices of the best block mode at the base layer
The reference frame indices of the best block mode at the enhancement layer The reference frame index of forward prediction
The reference frame index of backward prediction
The reference frame indices of the sub-optimal block mode at the base layer The time reduction at the enhancement layers
The base-layer encoding time with integer transform size of
The Kolmogorov-Smirnov test-statistic value The -test statistic value
The bivariate Gaussian distribution The bivariate Laplacian distribution
Chapter 1 Introduction
In the past few decades, the delivery of motion pictures over various channels, including the wired and wireless networks, becomes one of the popular applications, for instance, video on cell-phone and digital television broadcasting [1]. To resolve the constraints due to the heterogeneous network environments and the capability of terminals, a desirable video coding scheme encodes the video resource only once at the highest resolution that allows a partial decoding of the coded bit-stream for a specific target (bit-rate, frame rate, and resolution). Such a coding scheme is the so-called scalable video coding, which was recently developed and adopted by the international MPEG video standards. Currently, there are two major approaches on scalable video coding: one is the DCT-based scheme; the other is the wavelet-based coding method. The coding concepts of these two approaches are rather similar, particularly in removing the temporal redundancy. The scalable video coding extension of H.264/MPEG-4 AVC is a conventional block-based coding scheme and has been accepted as the ITU-T/MPEG standard in 2007 [2]. On the other hand, the newly coding structure realized by the wavelet technique potentially has its advantages, as mentioned in [3]. In this dissertation, we only focus on the scalable video coding extension [2] of H.264/MPEG-4 AVC [4] (referred hereafter as H.264/SVC), especially on the encoding complexity analysis and fast encoding parameter selection.
Section 1.1 Motivation
In response to the increasing demand for scalability features in many applications, the Joint Video Team has recently, based upon H.264/MPEG-4 AVC [4] (referred hereafter as H.264/AVC), standardized a scalable video coding standard [2] that furnishes spatial, temporal, quality (also termed as SNR) and their combined scalabilities within a fully scalable bit-stream. By employing multilayer coding along with hierarchical temporal prediction [5][6], H.264/SVC [2] encodes a video sequence into an inter-dependent set of scalable layers, allowing a variety of viewing devices to perform discretionary layer extraction and partial decoding according to their playback capability, processing power, and/or network quality. As a scalable extension to H.264/AVC, H.264/SVC [2] inherits all the coding tools of H.264/AVC [4] and additionally it incorporates an adaptive inter-layer prediction mechanism [7] for reducing the coding efficiency loss relative to the state-of-the-art single-layer coding [8][9]. A superior coding efficiency is achieved with little increase in decoding complexity by means of the so-called single-loop decoding. These key features distinguish the H.264/SVC scheme [2] from the scalable systems in the prior video coding standards.
An H.264/SVC encoder [2], the operations of which are non-normative, can be quite flexible in its implementation, as long as its bit-streams conform to the specifications. The current Joint Scalable Video Model (JSVM) v.9 [10] employs a bottom-up encoding process that adopts the exhaustive mode search for encoding parameter selection. The exhaustive search strategy, though providing a good rate-distortion performance, spends a large amount of computations on evaluating
each possible coding option and it turns out that most of these options have little benefit in increasing coding efficiency.
Our study reveals that a large percentage of computations come from encoding the temporal/spatial/quality enhancement layers; more specifically, a quality enhancement layer requires approximately three times the computations of its base layer due to the extra motion search for inter-layer motion estimation and residual prediction. Fast encoding parameter selection algorithms are thus desirable and advisable for reducing the enhancement-layer computational complexity without significantly sacrificing the rate-distortion performance.
Section 1.2 Research Contributions
The contributions of this dissertation mainly focus on the development of fast parameter selection algorithms, as described below:
The coding parameters of temporal prediction type in temporal scalability: Our proposed scheme provides up to 66% reduction in encoding time, or equivalently, 3x speed-up. The hierarchical prediction structure of H.264/SVC requires additional 250% and
120% computations, as compared to the low-delay IPPP coding structure and the hierarchical prediction structure without the bi-directional prediction, respectively. That is, the bi-directional prediction consumes more than 50% of overall encoding time.
8x8).
The proposed measure index of relative rate-distortion improvement shows that the bi-directional prediction does not offer much compression efficiency in the finer partition modes (smaller than 8x8). That is, the two uni-directional predictions are sufficient.
The statistical goodness-of-fit test reveals that the two sets of prediction error generated by the two uni-directional predictions tend to be jointly Laplacian distributed.
The coding parameters of the intra prediction mode both in CGS and spatial scalabilities: Our proposed algorithm averagely achieves 63% complexity reduction, nearly, 3x speed-up.
The optimal intra mode selected by the enhancement coding layer is usually the one chosen by the base layer or its adjacent modes.
The values of rate and distortion in the Lagrangian cost statistically have the log-linear relationship between the coding layers.
The coding parameters of inter-prediction macroblock partition mode and inter-layer predictions in the combined CGS and temporal scalability: Our proposed method achieves 84% time saving in overall encoding process (almost 6x speed-up) and 20x speed-up in encoding the enhancement layers.
One CGS enhancement layer requires more than 200% computations, as compared to that of the base layer. The additional evaluation is due to the selection in encoding parameters of inter-layer predictions.
The coding mode of an inter-frame enhancement-layer macroblock can determine whether it should be intra-coded or inter-coded by referring to the coding type of the reference layer.
Referencing to the inter partition mode at the reference layer is effective in reducing the parameter space of inter block mode. By observing the conditional mode distributions, various look-up tables are designed for use at the enhancement layer. Moreover, the partitions smaller than 8x8 is inefficient.
The selection of reference frame list between coding layers is highly correlated. Moreover, the motion starting point can be adaptively selected by the motions determined at the base layer or the original motion vector predictor.
The coding efficiency of the inter-layer residual prediction is limited when the base layer is in poor quality.
Section 1.3 Dissertation Organization
The rest of this dissertation is organized as follows. In the Chapter 2, the commonly used coding tools adopted by H.264/AVC [4] and the additional inter-layer prediction mechanisms [7] in
Chapter 3 statistically analyzes the heritance of the temporal prediction type in the hierarchical block partition and its strong correlation to the finer partition modes, and theoretically constructs a set of thresholds, all of which efficiently avoid the unnecessary evaluations. The rates, distortions, and the prediction directions of the intra prediction are highly correlated between the coding layers, as examined in Chapter 4. Chapter 5 investigates the consistency of the coding parameters from the base layer to enhancement layers. Lastly, in Chapter 6, this dissertation is concluded by summarizing our proposed algorithms and the future work as well.
Chapter 2
Introduction to H.264/MPEG-4 SVC Coding System
The H.264/SVC [2] encoding system [10] provides three modalities of scalability, including, spatial, quality, and temporal scalability. The spatial and quality scalabilities are realized by adopting the layered approach, in which the bottom coding layer is processed followed by the upper coding layers. By taking the advantage of hybrid coding scheme [11], each coding layer inherits the conventional H.264/AVC [4] coding tools to form its basic coding structure, in which the hierarchical-B prediction structure is employed to achieve the temporal scalability and produce bit-streams with different frame rates. Such a layered coding scheme can greatly improve the coding efficiency by exploiting the hierarchical-B prediction structure to remove the temporal redundancy and with additional inter-layer coding tools to reduce the redundancy between coding layers. However, the improved coding quality comes at a high cost of increased computations due mainly to the exhaustive search in finding the optimal coding parameters.
Therefore, in the following sections, we briefly reviewed the main coding tools in H.264/AVC [4] in Section 2.1. Section 2.2 introduces the added inter-layer coding tools with the new syntax elements. In Section 2.3, to obtain high efficiency in video coding, a commonly used approach, the Lagrangian techniques, can find the optimal tradeoff in terms of rate-distortion performance. Finally,
Section 2.1 H.264/MPEG-4 AVC Architecture
H.264/AVC [4] is the latest video coding standard, which is also known as MPEG-4 Part 10 or MPEG-4 Advanced Video Coding (AVC). This state-of-the-art coding standard is developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG) as the Joint Video Team (JVT). Moreover, the standardization of H.264/AVC [4] has been completed in 2003.
2.1.1 Architecture Overview
The block diagram of the H.264/AVC encoder [4] is shown in Fig. 2-1. Its main components are composed of intra prediction, motion estimation, motion-compensation, deblocking filter, transform, quantization, and entropy coding. The motion estimation and motion compensation remove the temporal redundancy; the intra prediction, transform, and quantization remove the spatial redundancy; and the entropy coding eliminates the syntax redundancy. In addition, the deblocking filter reduces the blocking artifact.
The encoding procedure of H.264/AVC [4] can be briefly described below. Firstly, an input frame is split into 16x16 pixels macroblocks. Intra-prediction or inter-prediction (motion estimation) is employed on each block to generate the residual block. These residual data are then transformed and quantized for further entropy coding. In addition, the residual data are also passed through the reconstruction loop, including the inverse transform, de-quantization, and deblocking filter, in order to reconstruct the reference frame for motion estimation.
Transform & Quantization Motion Estimation Motion Compensation Picture Buffering Entropy Coding Intra Prediction Intra/Inter Mode Decision Inverse Quantization & Inverse Transform
Deblocking Filter + -+ +
Video Input Bitstream
Output
Fig. 2-1 H.264/AVC encoder structure [4]
2.1.2 Basic Coding Tools
With the high efficiency coding tools, H.264/AVC [4] can outperform the earlier MPEG-4 and H.263 standards, which provides better compression of video images. Because of its superior performance, H.264/AVC [4] is becoming the worldwide digital video standard for consumer electronics and personal computers. In the following, we briefly overview the concepts of the main coding tools in the H.264/AVC standard [4].
2.1.2.1 Intra Prediction
The correlation of neighboring area within a video frame is remarkably high. By using the intra prediction, the spatial redundancy of the neighboring region could be reduced. In H.264/AVC [4],
the basic intra-prediction element of luminance samples is 4x4 blocks, 8x8 blocks, or 16x16 blocks, and the basic intra-prediction element of chrominance samples is 8x8 blocks. The intra prediction for a macroblock generates the prediction values from its adjacent blocks borders (top-left, top, top-right, and left).
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
A
B
C
D
I
J
K
L
M
E
F
G
H
mode 1
mode 6
mode 0 mode 5
mode 4
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
A
B
C
D
I
J
K
L
M
E
F
G
H
mode 8
mode 3
mode 7
0
1
4
5
2
3
6
7
8
9
12
13
10
11
14
15
Fig. 2-3 The scan order of sixteen 4x4 sub-blocks
When a macroblock selects the intra 4x4 prediction for encoding, it is divided into sixteen 4x4 blocks. For each 4x4 sub-block, it chooses one of the nine modes, including eight directional prediction modes and the DC mode, to obtain high compression efficiency. As illustrated in Fig. 2-2, the pixels labeled A to M are the pixels of the adjacent blocks that have previously been encoded and reconstructed to form a prediction reference. The sixteen 4x4 blocks in a macroblock are processed in the pre-defined scan order as shown in Fig. 2-3. Moreover, the procedure of intra 8x8 prediction is similar to that of intra 4x4 prediction.
single operation by selecting the intra 16x16 prediction mode. As shown in Fig. 2-4, there are four available prediction modes, which are vertical, horizontal, DC, and plane mode.
0 (vertical) H V H V H V Mean(H+V) H V 1 (horizontal) 2 (DC) 3 (plane)
Fig. 2-4 Prediction modes of intra 16x16
2.1.2.2 Inter Prediction, Motion Estimation and Motion Compensation
In a high frame rate video sequence, the successive frames in a short time interval are very likely to be similar. Therefore, the concept of the inter prediction aims to generate predicted pixels from previous decoded frame. That is, an intelligent way to reduce the temporal redundancy is to transmit the difference between successive frames. Such a concept has been widely used in most video compression standards.
2.1.2.2.1 Variable Block-Size Motion Compensation
H.264/AVC [4] adopts a variable block-size macroblock partition to offer better compression quality as compared to the previous video standards. As depicted in Fig. 2-5, each macroblock could be divided in one of the four large block partitions, which are 16x16, 8x16, 16x8, and 8x8. Furthermore, each 8x8 block can be decomposed into finer partitions of 8x8, 4x8, 8x4, and 4x4 blocks if the 8x8 partition is preferred. The large partition might be chosen in smooth regions while the small partition
might work better in texture-complicated areas.
MB
Types
0
0
1
0
1
0
1
2
3
16x16
16x8
8x16
8x8
8x8
Types
0
0
1
0
1
0
1
2
3
8x8
8x4
4x8
4x4
Motion vector accuracy ¼ pixel (6-tap filter)
Fig. 2-5 Variable block-size macroblock partition
The partitioned sub-block inside an inter-coded macroblock are predicted from the same size partition region in the reference frame. The distance between these two regions is denoted by the motion vector, which needs to be transmitted to the decoder side. In H.264/AVC [4], the motion accuracy can be up to a -pixel resolution for luma components and -pixel resolution for chroma components. The pixel values of the fractional positions are computed by using interpolation filters, including a six-tap filter and a bilinear filter. The theoretical analysis of the fractional-pixel motion compensation can be found in [12][13].
2.1.2.2.2 Hierarchical-B Prediction Structure with Bi-directional Motion Compensation
With the support of variable block-size macroblock partition, the prediction structure also noticeably affects the coding efficiency. Recently, H.264/AVC [4] and its scalable extension adopt a widely used prediction structure, the hierarchical prediction structure, to offer an excellent coding performance if
Currently, H.264/AVC [4] and its scalable extension [2] can perform the dyadic hierarchical prediction [5][6] in the encoding configurations of their reference software. In this case, a set of successive images is partitioned into groups, so-called Group of Pictures (GOP), whose size is typically power of two. Furthermore, a GOP is composed of one temporal base layer and one or more temporal enhancement layers. For example, if the GOP size is , then its structure consists of the temporal base layer and temporal enhancement layers, denoted as , ,…, . The anchor frame of each GOP forms the temporal base layer which is either I- or P-frame. The remaining frames, located at the temporal enhancement layers, are coded as hierarchical-B frames. The B-frames have three candidates in temporal prediction, consisting of two uni-directional predictions, forward prediction (FW) and backward prediction (BW), and a bi-directional prediction (BI) mode. Moreover, the BI is restricted to take the weighted sum of one preceding and one succeeding reference frames for prediction. Such a coding mechanism is referred to as the hierarchical-B prediction structure.
Fig. 2-6 demonstrates an example of encoding frames by hierarchical-B prediction structure with GOP size = 16. If only pictures and anchor frames / are transmitted, the reconstructed sequence at the decoder side has of the temporal resolution of the input video sequence. By additionally transmitting frames , the decoder can reconstruct the frame sequence that has one-eighth of the temporal resolution of the input video. Finally, if the remaining frames
I
0 B3 B2 B3 B1 B3 B2 B3 P0 B3 B2 B3 B1 B3 B2 B3 P0
group of pictures (GOP) group of pictures (GOP)
B
4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B
4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4 B4
Fig. 2-6 An example of hierarchical-B prediction structure with GOP size = 16
Moreover, the coding efficiency for the hierarchical-B prediction structure is highly dependent on the assignment of quantization parameters ( ) to the temporal layers [8]. With the GOP size being and a given initial quantization parameter , the quantization parameters for a temporal layer ( ) are determined by
(2.1)
With this assignment strategy, temporal scalability achieved by the dyadic hierarchical-B prediction provides a high compression quality. In comparison to the commonly adopted IBBP and IPPP coding structures, the Y-PSNR can be averagely improved by more than 1.0 dB and 2.0 dB, respectively. Moreover, in this structure, each reference list with one reference frame is satisfactory to obtain a sufficiently high performance. Empirically, the maximum coding efficiency occurs when the GOP size is between 8 and 32, as reported in [9].
For the theoretical analysis of the B-frame and its related works, [14]–[20] provide the detailed explanations.
2.1.2.3 Transform, Scaling, and Quantization
Similar to previous video coding standards, H.264/AVC [4] makes use of block transform coding of the prediction error to remove the inter-pixel redundancy. Moreover, the concept of adaptive block-size transforms (ABT) is adopted for improving the subjective and objective quality. The ABT can apply transforms for the block sizes of 4x4, 4x8, 8x4, and 8x8 pixels. The basic transform coding process is very similar to that of previous standards. At the encoder, this process is composed of a forward transform, zig-zag scanning, scaling, and rounding as the quantization process followed the entropy coding. At the decoder, the inverse of the encoding process is performed, except for the rounding.
The transform coding in H.264/AVC [4] uses the separable 2D transform to process 2D block signal, which is written as
(2.2)
where denotes a matrix representing the prediction error of pixels and lines, is the transform matrix, and is the transform coefficient matrix. This transform matrix consists of a set of orthogonal base functions. To minimize the computational complexity in transform coding, H.264/AVC [4] restricts the transformation computed exactly in integer arithmetic, avoiding inverse mismatch problems. In addition, the designed transform matrix should be close to the statistically optimal Karhunen-Loeve transform (KLT) and the discrete cosine transform (DCT) [21] can well approximate to the KLT [22]. Hence, according to these two constraints, a 4x4 transform [23] and an
8x8 transform [24] are specified by
(2.3)
and
(2.4)
Both of these two transforms are division free and can be implemented in a butterfly structure by using additions, bit-shift operations, and a few multiplications.
Furthermore, H.264/AVC [4] employs a hierarchical transform [25] to achieve a better inter-pixel de-correlation; that is, the DC coefficients of the adjacent 4x4 transform blocks are grouped into another 4x4 block and transformed again by a second-level transform.
After the ABT, the quantization process is the step that introduces signal loss to remove the psychovisual redundancy. For a given value, the encoder performs quantization and scaling, in which the details can be referred to [23][26]. The quantization parameter , which ranges from 0 to 51, is used to determine the quantization step size for quantizing the transform coefficients . The quantization parameter and quantization step size are related by
.
These values are formulated so that an increase of unity in means an increase of by approximately 12%. It can be also noticed that an increase of unity in roughly reduces the coding bit-rate by 12%.
2.1.2.4 In-loop Deblocking Filter
In encoding the successive frames by H.264/AVC [4], two coding tools mainly causes the blocking artifacts [27]. The most significant one is the block-based transform coding where the coarse quantization usually introduces the visual discontinuities at the block boundaries. The other is the motion compensated prediction where the reference blocks are usually copied from the interpolated data from different areas of different reference frames. The copying process carries the existing edges into the interior of the block to be compensated. Although the small-size transform in H.264/AVC [4] can marginally reduce this phenomenon, a deblocking filter is still an advantageous coding tool to maximize the coding performance.
Two common schemes can integrate the feature of the deblocking filter into video coding standards: post filters or in-loop deblocking filters. The post filter operates in displaying and it is outside of the coding loop. Moreover, it needs an additional frame buffer to pass the filter frames to the display device. On the other hand, the so-called in-loop deblocking filter is applied within the
coding loop to generate the filtered frames as the reference for motion compensation.
Employing the filtering inside the coding loop is superior to the post filtering, as listed below. The in-loop deblocking filter can preserve a certain level of quality in coded frames.
As compared to the post filter, there does not need any extra frame buffer while decoding. In the method of in-loop deblocking filter, it is realized by macroblock-wise checking the edge strength and the filtered data are directly stored in the reference frame buffer.
The empirical experiments demonstrate the in-loop deblocking filter can improve the objective and subjective quality.
Table 2-1 Determination of Boundary-Strength [27]
Block modes and coding parameters
One of the blocks is Intra and the edge is a macroblock edge 4
One of the blocks is Intra 3
One of the blocks has coded residuals 2
Difference of block motion luma sample distance 1
Motion compensation form different reference frames 1
Else 0
As mentioned, a Boundary-Strength ( ) parameter, ranging from 0 to 4, is assigned to each edge between two adjacent 4x4 luminance pixel blocks. Depending on coding modes and the coding
parameters of these two blocks, Table 2-1 determines how the value can be obtained. Then, according to the value, the in-loop deblocking filter detects and analyzes the blocking artifacts, and attenuates them by employing a selected filter. Further information in the concept the filter design can be referred to [28]–[40].
2.1.2.5 Entropy Coding
In H.264/AVC [4], two methods of entropy coding are supported to remove the coding redundancy in representing the transmitted signal. The simpler entropy coding method adopts a single infinite-extent codeword table for all syntax elements, except for the quantized transform coefficients. The chosen single codeword table is an Exp-Golomb code which is very simple and regular to decode. While coding the residual data, a block of transform coefficients is mapped into a 1D data using a pre-defined scanning pattern, such as the zia-zag scan.
For representing the quantized transform coefficients, a more efficient method called Context-Adaptive Variable Length Coding (CAVLC) is employed. The VLC tables for various syntax elements are switched depending on prior coded syntax elements. Since the VLC tables are designed to match the corresponding conditional statistics, the entropy coding performance is improved in comparison to those using a single VLC table.
In the CAVLC process, the following items are coded in a proper order: number of nonzero coefficients, sign marks of trailing ones, levels of remaining nonzero coefficients, number of total zeros, and runs of zeros between nonzero coefficients. In encoding process, these coefficients should
be scanned in the reversed zig-zag order before coding.
The efficiency of entropy coding can be further improved if the Context-Adaptive Binary Arithmetic Coding (CABAC) is used [41]. The usage of arithmetic codes [42] can most easily resolve the inter-symbol redundancy, which allows the assignment of a non-integer number of bits to each symbol of an alphabet. Moreover, it is extremely beneficial for symbol probabilities that are greater than 0.5.
In H.264/AVC [4], the arithmetic coding core engine consists of three elementary steps: binarization, context modeling, and binary arithmetic coding. The binarization uniquely maps a given non-binary valued syntax element to a binary sequence. Another important feature of CABAC, the context modeling, estimates conditional probabilities based on the statistics of prior coded syntax elements. Then, these conditional probabilities are used for switching several estimated probability models. Finally, the arithmetic coding core engine and its estimated probability model are specified as multiplication-free and low-complexity methods by using only shifts and table lookups. As compared to CAVLC, the CABAC typically provides a bit-rate saving between 5% to 15%. More details on CABAC can be found in [41].
Section 2.2 Additional Coding Tools in H.264/SVC
To have a better understanding of our coding algorithms, this section explains the basic concepts of H.264/SVC [2] and its coder control. Some degree of familiarity with H.264/AVC [4] is assumed
extension [2].
2.2.1 Overview of Layered Coding Structure
In order to support the spatial, temporal, and fidelity (SNR) scalabilities, H.264/SVC [2] encodes a video sequence into a layer-dependent set of scalable layers. Along the temporal axis, a group of pictures (GOP) is decomposed into a temporal base layer and one or more temporal enhancement layers in a nested, hierarchical fashion. Frames belonging to a lower temporal layer are coded independently of the higher temporal layers . For the applications that require lower temporal frame rates, only the frames that constitute the needed lower layers are decoded. In principle, the temporal frame rate (temporal prediction structure) does not have to be dyadic. The prediction structure can be modified as needed and can vary over time to support irregular, non-dyadic scalability. In this chapter, however, we consider only the dyadic temporal scalability case so that we can use the current release of JSVM software [10].
In the spatial dimension, H.264/SVC [2] adopts the conventional approach of image pyramid to represent a source video sequence at various spatial resolutions [43]. The spatial encoding process begins with a multi-resolution decomposition of the original high-resolution sequence. The lowest-resolution sequence is coded by H.264/AVC [4] as the base layer, and each higher resolution sequence is coded sequentially as a spatial enhancement layer. A specified spatial resolution image is reconstructed at the decoder when all its designated layers are received. A similar philosophy is carried over to facilitate the quality (SNR) scalability. In this scalability mode, the base layer and the
enhancement layers have identical spatial resolutions, but different quantization step sizes. base_mode_flag 1 0 0 mb_type motion_prediction_flag ModeBLis Intra 1 Derive mv and ref. index from the enhancement layer Derive mvp, ref. index
from the base layer NO YES residual_pred_flag 1 Inter-layer residual prediction 0 IntraBL BLSkip Inter-layer intra-prediction Derive MB partition, mv, ref. index from the
base layer
Fig. 2-7 Syntax elements and their combinations for the inter-layer prediction in the coarse-grain quality scalability (CGS) [2][7]
2.2.2 Inter-Layer Prediction Tools
To achieve the high coding efficiency goal, H.264/SVC [2] has an adaptive inter-layer prediction mechanism [7], which enables the usage of as much decoded information of the reference/base layer as possible to be reused for the enhancement layers. H.264/SVC [2] adopts the inter-layer prediction tools, which are inter-layer motion prediction, inter-layer residual prediction, and inter-layer intra-prediction.
Despite certain restrictions, these inter-layer prediction tools can be combined together to form a number of coding modes for each enhancement-layer macroblock. Fig. 2-7 shows all possible combinations of the base_mode_flag, motion_prediction_flag, and residual_prediction_flag, as well
as their associated coding modes. The detailed information is described in the following subsections.
2.2.2.1 Inter-Layer Motion Prediction
To avoid repeatedly sending the same motion parameters in the cases when the enhancement layer cannot benefit from motion refinement, a flag (base_mode_flag) can be sent for each non-skipped macroblock to indicate whether its motion parameters (partition mode, reference indices, and motion vectors) are to be inferred from the reference/base layer. In the other cases when it is more efficient to change the macroblock mode but leave most of the other parameters unchanged, another flag (motion_prediction_flag) can be additionally sent for a reference picture list to signal whether the reference frame index and motion vector are predicted from the reference/base layer.
For CGS/spatial enhancement layers, H.264/SVC [2] has the syntax element base_mode_flag to signal a new macroblock type, termed as BLSkip mode, in which only the residual signal needs to be transmitted. When the base_mode_flag is equal to 1 and the reference-layer macroblock is inter-coded, the enhancement-layer macroblock is also inter-coded. In this case, the partition mode of the enhancement-layer macroblock with the associated reference indices and motion vectors are derived from the information of the co-located reference-layer block.
In addition, the (scaled) motion vectors of the reference-layer co-located block can be used to be the motion vector predictor to the enhancement-layer macroblock if it is conventionally inter-coded. In this case, the syntax element motion_prediction_flag is set to 1 and the reference indices of the co-located reference-layer block are reused.
2.2.2.2 Inter-Layer Residual Prediction
To enhance the coding efficiency of inter-coded macroblock within the framework of single-loop decoding, the residual prediction, which subtracts the residual signal of the reference/base layer from that of the enhancement layer, can be adaptively activated by the residual_prediction_flag. The inter-layer residual prediction can be applied for all inter-coded macroblocks in each coding layer.
When the residual_prediciton_flag is set to 1, the residual signal from the corresponding reference-layer block (up-sampled by using a bilinear filter [43]) is used as the prediction of the residual signal of the enhancement-layer macroblock.
2.2.2.3 Inter-Layer Intra Texture Prediction
To provide a better prediction for the enhancement-layer samples, especially for the fast-motion sequences, the reconstructed samples of the reference/base layer can be used as an alternative prediction source. However, the texture prediction is available only when the co-located macroblock is an intra-coded macroblock with constrained intra prediction, because the single-loop structure prohibits the reference/base layer to conduct motion compensation after it being coded.
The inter-layer intra-prediction occurs when an enhancement-layer macroblock is coded with the base_mode_flag equal to 1 and the co-locarted reference-layer block is intra-coded. The enhancement-layer prediction signal is the reconstructed intra-signal (up-sampled by one-dimensional four-tap FIR filters applied horizontally and vertically [43]).
Section 2.3 Rate-Constrained
Coder
Control
in
H.264/AVC-Based Video Standards
In most video coding standards including MPEG-2 Visual [44], H.263 [45], MPEG-4 Visual [46], and H.264/AVC [4] as well as its scalable coding standard [2], their specifications only represent the syntax structure of the bit-stream in the decoding process. However, the operational control of the video encoder is an important issue in video compression [47].
The bit allocation can resolve the problem in the operational control for efficient coding. With a motion-compensated hybrid coder, the bit-rate represents the total consumed bits, consisting of the motion vectors, the residual data, and additional side information. Those transmitted data are divided into independent bit-streams with bit-rate , which yields the overall rate
(2.6)
At the decoder, there exists distortion between the original frame and the reconstructed frame. We assume that the distortion-rate function is strictly convex and differentiable everywhere, and that
(2.7)
That is, increasing the rate to any one of the bit-streams should decrease the distortion. Hence, the optimum bit allocation that minimizes subject to a fixed overall rate can be found by letting
(2.8) Moreover, from Eq. (2.6), we derive
(2.9) Furthermore, from Eq. (2.8) and Eq. (2.9), we can find the optimum bit allocation condition
(2.10)
This result provides an interpretation that we should add bits to the bit-stream with the smallest . This allocation principle would obtain the greatest decrease in distortion.
However, the application of this bit allocation scheme to control a hybrid video coder is not straightforward. In the following subsections, the concept of Lagrangian optimization schemes is briefly reviewed. Its application to control a hybrid video coder is then introduced. Particularly, based on the Lagrangian approach, the selection procedures of the temporal prediction types and the best block modes are described in detail.
2.3.1 Optimization Using Lagrangian Schemes
The task of coder control aims to determine a set of coding parameters and produces the corresponding bit-stream such that the optimal coding efficiency can be achieved with a given rate constraint. Recently, the most widely adopted approach is the Lagrangian bit allocation scheme due to its effectiveness and simplicity.
The Lagrangian optimization approaches transfer the constrained problem (see (2.11)) to an unconstrained problem (see (2.12)) by introducing a new variable , called the Langrangian multiplier.
(2.12)
(where is the combination of coding options, is the given rate constraint, and denotes the cost function)
Because the combination of coding options is finite, the discrete optimum solution to this unconstrained problem was introduced in [48].
2.3.2 Lagrangian Optimization in Hybrid Video Coding
During the video encoding, a variety of coding parameters such as motion vectors, block partition modes, transform coefficient levels and transform block sizes have to be determined. Thus, those coding parameters can be intelligently selected with respect to the rate-distortion efficiency by using the Lagrangian optimization techniques [49]. Minimizing the Lagrangian cost has to proceed over the space of the coding parameters for all blocks in the entire video sequence, introducing a great amount of computation while encoding.
Typically, for each macroblock, the coding block mode with associated parameters is optimized with the given decisions of prior coded blocks only. In other words, the optimization problem
(2.13)
is simplified to
(2.14)
which can be easily solved by independently selecting the coding parameter for each block.