視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

(1)

國

立

交

通

大

學

電子工程學系電子研究所

博士論文

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

An Algorithm and Its Architecture Design Space

Exploration of a Video Embedding Transcoder

研究生：李志鴻

指導教授：蔣迪豪教授

(2)

(3)

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

An Algorithm and Its Architecture Design Space

Exploration of a Video Embedding Transcoder

研究生：李志鴻 Student：Chih-Hung Li

指導教授：蔣迪豪 Advisor：Dr. Tihao Chiang

國立交通大學

電子工程學系電子研究所

博士論文

A Dissertation

Submitted to Department of Electronics Engineering and Institute of Electronics

College of Electrical and Computer Engineering National Chiao Tung University

in partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy in

Electronics Engineering

June 2008

Hsinchu, Taiwan, Republic of China

(4)

(5)

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

研究生：李志鴻指導教授：蔣迪豪博士

國立交通大學

電子工程學系暨電子研究所

摘要

視訊嵌入服務在今日多元化的多媒體應用中越來越廣泛，由於多媒體的龐大資料量，現今大部分的視訊資料都以壓縮的格式儲存與傳遞，在眾多壓縮標準中， H.264/AVC 已成為目前視訊壓縮的主流，因此針對 H.264/AVC 標準的視訊嵌入轉碼器將更加重要，以達到資料儲存以及網路傳輸的高效率。本論文主要是針對視訊嵌入轉碼器的演算法發展與硬體架構設計探索：第一，我們所發表的 H.264/AVC 視訊嵌入快速轉碼演算法乃為目前文獻上第一篇在 H.264/AVC 標準下的視訊嵌入轉碼技術。第二，關於硬體架構設計，我們利用最低的成本，成功結合了 H.264/AVC 轉碼與編碼功能於單一個硬體架構中，此乃文獻中第一個實現此技術的設計。第三，關於硬體設計探索(Design Space Exploration)，則是文獻上第一篇以資料交換層級(Transaction Level Modeling)做系統效能模擬分析的 H.264/AVC 相關之硬體設計。

本論文第一部份著重於多視窗視訊嵌入轉碼器 (Multiple-Window Video Embedding Transcoder)之低複雜度演算法的發展。為解決傳統上串聯式像素值域轉碼器(Cascaded Pixel Domain Transcoder)的高複雜度難題，我們採用部份重新壓縮(Partial Re-encoding)的概念來降低轉碼時所需的運算量，並減少因重新量化 (Re-quantization) 所造成的轉碼視訊品質下降。針對預測不協調 (Prediction Mismatch)的區塊，我們利用原始壓縮位元流內的資訊來幫助預測微調(Prediction Refinement)，即幅內模式轉換(Intra Mode Switching)以及運動向量重新映射 (Motion Vector Re-mapping)，如此，我們可以完全去除壓縮器中複雜度最高的兩個模組：模式決策(Mode Decision)和運動估計(Motion Estimation)。針對殘餘值不協調(Residue Mismatch)的區塊，我們從理論與實驗數據的推導，有效率的找出最需要做錯誤修正(Error Correction)的區塊，從實驗數據顯示，我們只需針對極

(6)

雜度演算法最多可以有將近 1.5dB 的 PSNR 改善。

本論文第二部份著重於設計空間探索(Design Space Exploration)。基於所提出的低複雜度演算法，我們將之實現於平台式(Platform-based Design)系統設計並以最節省的成本將 H.264/AVC 轉碼器與解碼器結合在一個平台上。在所提出的高效能系統硬體架構中，我們針對幾個重要的系統參數做探索：硬體平行度 (Parallelism) 、資料交換精細度(Data Exchange Granularity)以及設計平衡(Design Balancing)。不同於傳統由下而上的設計哲學(Bottom-Up Design Methodology)，我們採用新穎的由上而下、逐步精確的設計哲學(Top-Down and Refinement-based Design Methodology)以獲得較優異的探索效能。我們主要採用電子系統層級 (Electronic System Level) 來做系統模擬以及探索，其模擬平台大都是操作在資料交換層級(Transaction Level Modeling)，其模擬的效能較傳統的暫存器傳輸級 (Register Transfer Level)快上三個數量級左右。因此，比起傳統的系統設計，我們的設計提供相當大的自由空間來針對不同的設計限制作最佳化：針對硬體成本最佳化，我們的設計選擇(Design Alternative)可以將硬體成本降低至原本的 25%。針對速度最佳化，我們的設計選擇可以將速度增加為原本的兩倍。在 135MHz 的操作時脈下，我們可以針對 1920x1088 每秒 60 幅的高畫質視訊提供即時的轉碼或解碼輸出。本論文第三部份著重於低成本高效能的硬體模組設計開發，我們著重在兩個核心區塊：像素預測(Pixel Prediction)和去邊濾波器(Deblocking Filter)。第一，我們成功結合了 H.264/AVC 中幅內與幅間預測於單一個硬體架構中，除了增加硬體使用效率外，亦大幅減少資料匯流排上的傳輸。第二，我們提出了一個具有同步細緻可調(Fine-Grained Synchronization Capability)的去邊濾波器，如此可以讓視訊資料管線(Video Pipe) 的效能在不同的資料交換精細度(Level of Granularity) 中，都能獲得提升。

總結，本論文提出一個低複雜度、高效率之視訊嵌入轉碼器。在演算法上，我們著重於快速有效率的微調與修正。在硬體設計空間探索上，我們有效率且量化地分析了各個系統參數的影響，以期在不同的設計限制下都能獲得最佳化。在硬體設計上，我們著重於硬體使用效率的增加以及整體系統效能的增加。

(7)

An Algorithm and Its Architecture Design Space

Exploration of a Video Embedding Transcoder

Student: Chih-Hung Li Advisor: Dr. Tihao Chiang

Department of Electronics Engineering & Institute of Electronics

National Chiao Tung University

ABSTRACT

As the H.264/AVC standard is receiving worldwide adoption, the video embedding service is an important feature and thus this thesis presents an H.264/AVC multiple-window video embedding transcoder in three parts including an algorithm, a system architecture design space exploration, and two novel micro-architectures.

The first part describes a low-complexity algorithm as compared to the traditional cascaded pixel domain transcoder. The partial re-encoding is adopted to reduce the complexity and quality degradation due to re-quantization. Specifically, the intra mode switching and motion vector re-mapping techniques are used to eliminate the need for the mode decision and motion estimation modules. Moreover, with the theoretical analysis, only 5% of the total blocks needs error correction. The proposed approach can improve the quality up to 1.5 dB and enhance the throughput by 25 times as compared to the traditional cascaded transcoder.

The second part describes architecture for platform-based video embedding transcoder and its design space exploration from several system aspects: hardware parallelism, data exchange granularity and design load balancing using transaction level modeling. The top-down refinement-based design methodology provides effective exploration with high degree of freedom to optimize for various design constraints. Further, it identifies which critical module and how it can be optimized in terms of speed, memory and bandwidth for improving the overall performance. Our best design alternative can reduce the cost by four times or speed up by two times such that our design can achieve 1920x1088 @ 60 Hz video transcoding at 135MHz.

The third part describes the two novel micro-architectures designed for the prediction and the deblocking filter modules. The prediction is unified with a systolic architecture to improve the hardware utilization and the transmission bandwidth. The de-blocking filter is implemented with a multi-level fine-grain synchronization granularity to improve the system performance at finer level of granularity.

(8)

進入研究所的七個年頭，累積了太多的感謝。首先我要感謝我的指導教授蔣迪豪老師在這七年中給我的諸多指導，不僅在學術研究的方向上不斷地引領我走在專業領域的先端，更讓我培養了獨立解決問題以及清楚表達想法的能力。每次與蔣老師的討論都像是腦力激盪，一點一滴激發我向前進步的潛能。除此之外，我也非常感謝蔣老師這些年來在各方面給我的關心鼓勵與幫助。實驗室中優秀的學長學弟們則是研究生涯當中另一個幫助我進步的動力。王俊能學長是最早鼓勵我攻讀博士學位的，在我最青澀的時期，不厭其煩地修改我殘破的論文寫作，更在我中途最灰心的時期再次給予我堅持到底的信心。彭文孝學長則是在我博士班後期帶領我走進硬體設計的相關領域並教導我一些研究方法與論文寫作上的經驗。黃項群學長清晰敏銳的思路以及對於研究的熱情，一直是我所景仰與學習的榜樣。王世豪學長在硬體設計與伺服器架設的熟練經驗，也常是我打擾請教的對象。親切熱心的李俊毅學長總像個大哥哥般似的幫我解決在實驗室遇到的任何問題。而與思考敏銳的治傑共同研究則是充滿了愉快。其他不管是已畢業或是仍在實驗室打拼的學長同學與學弟們，峯誠、家揚、崑健、健霖、耀中、秉玉、沛昀、宗延、鑑明、偉倫、孝強、子良、掁韋、朝雄、志凱、世騫、鴻志、世炘、德宣，謝謝你們一同營造這個讓我感動的實驗室。在此我也要感謝張隆紋老師、黃仲陵老師、李鎮宜老師、王聖智老師、李國君老師、蔡宗漢老師百忙之中撥冗前來參予我的口試，給予我在求學生涯上的最後一堂課，因為有你們的寶貴意見使得論文能夠更加完備。尤其是李國君老師在過去這一年多的時間不厭其煩地給予我在系統層級設計上的一些專業意見。謝謝這些在學術研究上不斷幫助我的貴人，讓我能夠以更謙卑的心態來看待這個學位，期待這個學位能夠成為往後不斷督促我進步的動力。最後，我要感謝我的家人，尤其是媽媽在我的求學過程中，總是給我無條件的支持，可以讓我無後顧之憂做自己想做的事。而我的女友楓珮總是直接且深刻地感受我在研究與撰寫論文時的種種煎熬與壓力，在我低潮失意的時候，更是給予我最大的鼓勵與幫助，陪伴我完成這個人生中重要的里程碑。簡短的文字實在難以完全紓發內心的由衷感激，沒有你們這些人，也許沒有今天的我，也謝謝老天給我足夠多的運氣，謹以此論文獻給所有關心我的人，希望我的努力沒有辜負你們的期望。李志鴻謹誌於台灣新竹交通大學西元 2008 年 7 月

(9)

(10)

Abstract in Chinese i

Abstract iii

Acknowledgements iv

Contents vi

List of Tables x

List of Figures xii

List of Notations xviii

1 Introduction 1

1.1 Overview of Dissertation . . . 1

1.1.1 Motivation . . . 3

1.1.2 1st_{Focus: Low-Complexity Algorithm Development . . . .} ₄

1.1.3 2nd_{Focus: System Architecture Design Space Exploration . . . .} ₅

(11)

CONTENTS

1.2 Organization and Contribution . . . 9

2 Background and Related Work on Video Embedding Transcoding 14 2.1 Introduction . . . 14

2.2 Realization of Video Embedding Service . . . 15

2.3 Problem Statement of Video Transcoding . . . 16

2.4 Wrong Reference Problem Formulation . . . 19

2.5 Related Work on Video Embedding Transcoding . . . 20

2.5.1 Cascaded Pixel Domain Transcoder (CPDT) . . . 20

2.5.2 DCT Domain Transcoding with Motion Vector Re-mapping . . . 21

2.5.3 DCT Domain Transcoding with Backtracking . . . 22

2.6 The Challenge in H.264/AVC-based PIP Transcoding . . . 22

2.7 Summary . . . 25

3 Low-Complexity Algorithm of MW-VET 26 3.1 Introduction . . . 26

3.2 Slice-Group-Based Transcoding . . . 28

3.3 No Frame Memory Transcoding (NFMT) . . . 30

3.3.1 Architecture of NFMT . . . 30

3.3.2 Auxiliary Bitstream Generation . . . 32

3.4 Reduced Frame Memory Transcoding (RFMT) . . . 32

3.4.1 Intra Mode Switching (IMS) . . . 37

3.4.2 Motion Vector Remapping (MVR) . . . 43

3.4.3 Syntax Level Bypassing (SLB) . . . 50

3.5 Simulation Results . . . 52

3.6 Summary . . . 61

4 System Architecture Design Space Exploration 65 4.1 Introduction . . . 65

4.2 Algorithm to Architecture Mapping . . . 67

4.3 Highly Ef cient System Architecture . . . 68

4.3.1 Memory Hierarchy . . . 72

4.3.2 Video Decoding Pipe . . . 73

(12)

4.3.4 Memory Sub-system . . . 75

4.3.5 Task Scheduling . . . 78

4.4 Effective Design Space Exploration . . . 80

4.5 Pruned Design Space . . . 84

4.5.1 Exploration of the Synchronization Granularity . . . 84

4.5.2 Exploration of the Design Combination and Balancing . . . 87

4.6 Evaluation of the System Performance . . . 89

4.6.1 Evaluation Metric . . . 89

4.6.2 Simulation Infrastructure . . . 91

4.6.3 Pareto-Based Multi-Objective Optimization . . . 94

4.7 Simulation Results and Analysis . . . 95

4.7.1 Pareto Analysis for the Exploration of Synchronization Granularity . . 95

4.7.2 Pareto Analysis for the Exploration of Design Combination within the Video Pipe . . . 101

4.7.3 Area-Weighted Hardware Utilization for the Exploration of Design Bal-ancing of the Video Pipe . . . 102

4.8 Summary . . . 103

5 A Highly Ef cient Micro-Architecture Design 106 5.1 Introduction . . . 106

5.2 An Ef cient Memory Sub-System . . . 108

5.2.1 Background . . . 108

5.2.2 Interleaved Data Arrangement . . . 109

5.2.3 External Memory Interface . . . 112

5.2.4 Synchronization Buffer . . . 114

5.2.5 Effect of Synchronization Granularity of Memory Sub-system on Off-Chip Transmission . . . 116

5.3 Combined Inter and Intra Prediction . . . 121

5.3.1 Motivation . . . 122

5.3.2 Overall Architecture . . . 124

5.3.3 Data Flow of the Inter Prediction . . . 129

(13)

CONTENTS

5.4 Ef cient Deblocking Filter with Fine-Grained Synchronization Capability . . . 146

5.4.1 The Proposed Architecture of Deblocking Filter . . . 153

5.4.2 The Proposed Filtering Order with Fine-Grained Synchronization Ca-pability . . . 154

5.5 Summary . . . 155

6 Conclusions 157 6.1 Summary of Contributions . . . 158

6.1.1 Improvement of Rate-Distortion Performance . . . 158

6.1.2 Design Space Exploration . . . 159

6.1.3 Improvement of Area-Speed Ef ciency . . . 160

6.2 Suggestions for Future Works . . . 162

Bibliography 163 Appendix 174 A Why Transform Domain Approaches are Inef cient for H.264 Transcoding 174 A.1 Integer Transform with Quantization Scaling . . . 175

A.2 Directional Intra Prediction . . . 176

A.3 In-the-loop De-blocking Filtering . . . 178

A.4 Sub-pixel Interpolation . . . 178

B Constant-Rate Bumping Process 182 B.1 Bumping Process in H.264/AVC . . . 183

(14)

3.1 The Symbol De nitions . . . 29 3.2 The Cases of the Intra4 Mode Switching . . . 39 3.3 The Cases of the Intra16 Mode Switching . . . 39 3.4 The Corresponding Operations of the RFMT for Each Block Type During the

VET Transcoding . . . 52 3.5 The Detailed Re nement during the VET Transcoding . . . 53 3.6 The Encoder Parameters for the Experiments . . . 56 3.7 The Improvement of Execution Time and Quality as Compared to the CPDT(1) ₅₇

3.8 The Effectiveness of Error Correction (EC) for Different Kinds of p-blocks . . 58 4.1 The Transcoding Throughput Performance of Proposed Algorithm of a

Soft-ware Implementation . . . 66 4.2 The Transcoding Throughput Performance of Proposed Algorithm of a

Soft-ware Implementation . . . 78 4.3 The Bitrate of Each Long Sequence with Different Resolution . . . 98 4.4 The Required Clock For Each Resolution When the Size of Display Buffer is

(15)

LIST OF TABLES 5.1 Analysis of the Different Levels of Granularities Based on the Worst Case

Au-umption . . . 116

5.2 The Comparison of the Intra Prediction . . . 140

5.3 Comparison of Inter Prediction. . . 143

5.4 The Execution Cycle of the Inter and Intra Prediction in All Cases . . . 144

5.5 The Corresponding Data Path of Each Filtered Edge . . . 156

A.1 Computational Complexity of Each Intra Prediction Mode For Both Operation Domain . . . 177

(16)

1.1 The Applications of the Video Embedding Service: (a) Live TV Program of CBBC. (b) Leatek's WinFast Series Product. (c) Disney Supports "Full Motion, Picture-in-Picture Bonus Features" in Newly Released Blu-Ray DVD. (d) Mul-tiple Stream Media Processor of DiscoveryBiz. (e) TV Setup Box of Allthings. (f) Sony Debuts In-Car Navigation with PIP in Japan. (g) Videoconferencing System of Wirered Company. (h) The PIP Functionality in Viliv X2 AIO. (i) YouTube Overlays Ads during Video Streaming (InVideo). (j) Video Surveil-lance System. . . 2 1.2 One of the Applications of Video Embedding Transcoder: Transcoding Server. 9 1.3 The Design Flow in This Dissertation . . . 13 2.1 Illustration of a Novel Transcoder: (a) The Simpli ed Transcoding Process. (b)

The Simpli ed Transcoder When the Prediction Blocks Are the Same. (c) The Fast Transcoder That Bypasses the Input Transform Coef cients. . . 18 2.2 Illustration of the Wrong Reference Problem . . . 20 2.3 The Architecture of the CPDT . . . 21 3.1 The Example of Slice Group That Can Be Used in the VET Transcoding . . . . 30

(17)

LIST OF FIGURES 3.3 The Generation of the Auxiliary Bitstream Based on the Reconstruction of a

RDO Encoder . . . 33 3.4 The Initial Architecture of the RFMT with RDO Re nement Based on the

Con-cept of the Partially Re-Encoding . . . 34 3.5 The Intermediate Architecture of the RFMT with the MVR and the IMS

Re-nement . . . 35 3.6 The Final architecture of the RFMT with Shared Frame Memory and the

Con-strained FG Bitstreams . . . 36 3.7 The Transcoding Scheme for the Channel Preview . . . 37 3.8 The Wrong Intra Reference Problem within a Macroblock Depending on the

Intra Modes . . . 38 3.9 The Relative Position of Each Case in the Intra Mode Switching Technique . . 38 3.10 An Example of the Intra Prediction Chain . . . 40 3.11 Illustration of the Motion Vector Re-mapping Technique. (a) The Original

Cod-ing Mode and Motion Vectors. (b) Re nement by UsCod-ing Inter4x4 Mode and Re-mapped Motion Vectors. . . 44 3.12 An Example of the Inter Prediction Chain . . . 47 3.13 The Flow Chart of the Proposed RFMT . . . 53 3.14 The Extended Wrong Reference Problem when Multiple Reference Frame is Used 54 3.15 The Visual Illustration after Each Re nement Step during the VET Transcoding 55 3.16 The Percentage of the Macroblock Types and the Block types during the VET

Transcoding . . . 57 3.17 The Rate-Distortion Performance of the Luminance Component When One

Foreground Carphone_QCIF is Embedded in Table_SD: (a) Table_SD_Carphone_QCIF_1_1. (b) Table_SD_Carphone_QCIF_33_20. . . 59

3.18 The Rate-Distortion Performance of the Luminance Component When One

Foreground Foreman_QCIF is Embedded in Mobile_SD: (a) Mobile_SD_Foreman_QCIF_1_1. (b) Mobile_SD_Foreman_QCIF_33_20. . . 60

(18)

3.19 The Rate-Distortion Performance of the Luminance Component by Four

Fore-grounds Embedding with the Single-Generation Transcoding: (a) Table_SD_MD_QCIF_1_1 _Stefan_QCIF_33_1_Carphone_QCIF_1_20_News_QCIF_33_20. (b) Mobile_

SD_MD_QCIF_1_1_Stefan_QCIF_33_1_Carphone_QCIF_1_20_News_QCIF

_33_20. . . 62

3.20 The Rate-Distortion Performance of the Luminance Component by Four Fore-grounds Embedding with the Multi-Generation Transcoding: (a) Table_SD_MD_QCIF_1_1 _Stefan_QCIF_33_1_Carphone_QCIF_1_20_News_QCIF_33_20. (b) Mobile_ SD_MD_QCIF_1_1_Stefan_QCIF_33_1_Carphone_QCIF_1_20_News_QCIF _33_20. . . 63

4.1 The Top Level Data Flow of RFMT and Its System Partitioning . . . 69

4.2 The Re nement for Each Processing Macroblock and Its Hardware Partitioning which is of Five–Staged Pipeline . . . 70

4.3 The System Architecture of the Proposed Video Embedded Transcoder and De-coder . . . 72

4.4 The subjective Quality Comparison of the 100th Frame: (a) Decoded Picture without Re nement Update (b) Decoded Picture with Re nement Update (c) Transcoded Picture without Re nement Update (d) Transcoded Picture with Re nement Update. . . 76

4.5 The scheduling of the Video Pipe Where P Means the CPU Programs the Indi-vidual Module . . . 79

4.6 The Scheduling of the Architecture A . . . 79

4.7 The Scheduling of the Architecture B . . . 80

4.8 The Scheduling of the Architecture C . . . 81

4.9 The Scheduling of the Architecture D . . . 82

4.10 The Scheduling of the Architecture E . . . 83

4.11 The Scheduling of the Architecture F . . . 84

4.12 The Flow of Our Design Space Exploration . . . 85

4.13 The Exploration of the Synchronization Granularity . . . 87

4.14 The SystemC Implementation with the Annotated Timing . . . 93 4.15 The Proposed H.264/AVC Video Embedding Transcoder and Decoder

(19)

Imple-LIST OF FIGURES 4.16 The Pareto Analysis for the Average Execution Cycle Count and Equivalent

Gate Count at Different Levels of Synchronization Granularity and for the Dif-ferent Designs of the Inter and Intra Prediction. The Combination of the Level of Synchronization Granularity to be Explored Includes 16x16_16x16, 16x16_8x8, 16x16_4x4, 8x8_8x8, 8x8_4x4, and 4x4_4x4 where the Terms before and after

the Underscore Indicate the GM and the GV, Respectively . . . 96

4.17 The Combined Pareto Analysis for the Average Execution Cycle Count and Equivalent Gate Count at Different Levels of Synchronization Granularity and for the Different Designs of the Inter and Intra Prediction . . . 97

4.18 The Execution Cycle of Architecture 16_16_SA3 and The Minimized Clock for Real-Time When the Size of Display Buffer is 32Mb. (a) 240x144. (b) 480x272. (c) 960x544. (d) 1920x1088. . . 100

4.19 The Pareto Analysis for the Average Execution Cycle Count and Equivalent Gate Count for Each Alternative where MB and B8 Denote the MB-based and B8-based DB, Respectively . . . 102

4.20 The Area-Weighted Hardware Utilization for Each Design Alternative . . . 103

4.21 Normalized Distance to Utopia Point Considering Execution Cycle Count, Hard-ware Cost, and Area-Weighted Utilization . . . 104

5.1 The Interleaved Data Arrangement for the Stored Pictures . . . 111

5.2 The Functional Block Diagram of the External Memory Interface . . . 112

5.3 The Command FIFO for the Detection of the Row Miss . . . 114

5.4 The Finite State Machine Designed for the Mobile DDR SDRAM . . . 115

5.5 The Ef ciency of DRAM Access for Motion Compensation. The Ef ciency is De ned as (# of Data Read from the External DRAM) / (Actual Data Required for the Motion Compensation) . . . 118

5.6 The Ef ciency of DRAM Access for the Decoder. The Ef ciency is De ned as (# of Data Read from the External DRAM) / (Actual Data Required for Decoder) 119 5.7 The Average Cycle Counts per MB . . . 120

5.8 The Amount of Data Transfer . . . 120

5.9 The Power Consumption in DRAM . . . 121

(20)

5.11 The 2-D Interpolation for the Motion Compensation with Sub-Pel Precision. Note that the 2-D Filtering Can Be Separated Into Two 1-D Filtering. . . 126 5.12 (a) The Uni ed Systolic Array for Both Inter and Intra Interpolation. (b) The

Block Diagram of Functional Block W. (c) The Weighting Mode of Functional Block W. (d) The Combination Mode of Functional Block W. . . 128 5.13 The Con guration of Inter Prediction for Luminance Component . . . 130 5.14 The Weighting Factor of Each Input for Consecutive Execution of the 6-tap

Filtering . . . 131 5.15 Input Scheduling of the Proposed Systolic Array that Uses Two-Input

Broad-casting . . . 131 5.16 Separated Filterings of the Inter Prediction for Chrominance Component . . . . 133 5.17 The Con guration of Inter Prediction for Chrominance Component . . . 134 5.18 Intra Prediction by Adaptive Filtering . . . 136 5.19 The Con guration of Intra Prediction for the Direction Modes. (a) The Two (1

,2, 1) Filters. (b) The (1, 1) and (1, 2, 1) Filter. . . 137 5.20 The Progression Property of Plane Mode . . . 139 5.21 The Con guration of Intra Prediction for the DC Mode and the Plane Mode.

(a) The Data Path for Accumulation. (b) The Generation of Gradient Values "b" and "c". (c) The Pixel Prediction of Plane Mode at Odd Cycles. (d) The Pixel Prediction of Plane Mode at Even Cycles. . . 141 5.22 The Execution Cycle of the Inter and Intra Prediction for the Blue_Sky Sequence 145 5.23 The Execution Cycle of the Inter and Intra Prediction for the Pedestrian_Area

Sequence . . . 146 5.24 The Execution Cycle of Inter and Intra Prediction for Riverbed Sequence. . . . 147 5.25 The Execution Cycle of Inter and Intra Prediction for Rush_Hour Sequence. . . 147 5.26 The Execution Cycle of the Inter and Intra Prediction for the Station2 Sequence 148 5.27 The Execution Cycle of the Inter and Intra Prediction for the Sun ower Sequence148 5.28 The Execution Cycle of the Inter and Intra Prediction for the Tractor Sequence . 149 5.29 The Data Transmission via AHB Data Bus for the Blue_Sky Sequence . . . 149 5.30 The Data Transmission via AHB Data Bus for the Pedestrian_Area Sequence . 150 5.31 The Data Transmission via AHB Data Bus for the Riverbed Sequence . . . 150

(21)

LIST OF FIGURES

5.33 The Data Transmission via AHB Data Bus for the Station2 Sequence . . . 151

5.34 The Data Transmission via AHB Data Bus for the Sun ower Sequence . . . 152

5.35 The Data Transmission via AHB Data Bus for the Tractor Sequence . . . 152

5.36 The Architecture of the Proposed Deblocking Filter . . . 154

5.37 The Proposed Edge Filtering Order with Fine-Grained Synchronization Capability155 6.1 The Improvement of Rate-Distortion Performance by the Proposed Low-Complexity Algorithm . . . 159

6.2 The Improvement of Throughput and Hardware Cost by the Proposed Effective Design Space Exploration . . . 160

6.3 The Improvement of Area-Speed Ef ciency by the Proposed High-Ef cient Ar-chitecture . . . 161

B.1 The ow chart of the bumping process in H.264/AVC. The “smallest_poc” stands for the smallest picture order count (POC) among the non-output pic-tures in the DPB. . . 183

B.2 The ow chart of the proposed constant-rate bumping process. The “small-est_poc” stands for the smallest picture order count (POC) among the non-output pictures in the DPB and the “RB” denotes the regulation buffer. . . 185

B.3 Example for Comparing the Variable-Rate Bumping Process and the Proposed Constant-Rate Bumping Process . . . 186

(22)

HT ( ) Operation of integer transform

IHT ( ) Operation of inverse integer transform Q( ) Operation of quantization

DQ( ) Operation of de-quantization P RED( ) Operation of pixel prediction

Pe Encoding process from pixel domain to transform domain

Pd Decoding process from transform domain to pixel domain

M C( ) Operation of motion compensation IPn( ) Operation of intra prediction with mode n

rn Original residue of the n-th block

xn Decoded pixels of the n-th block

rn0 Re ned residue of the n-th block

x0

n Decoded pixels of the n-th block after re nement

c

(23)

LIST OF NOTATIONS en Quantization error of n-th block

P [i] Cycle count of the system at i-th pipelined stage

Ej[i] Cycle count of the j-th component at i-th pipelined stage

uj[i] Hardware utilization of the j-th component at i-th pipelined

stage

(24)

Introduction

1.1 Overview of Dissertation

The Video Embedding Service (VES), which is also known as the Picture-in-Picture (PIP) fea-ture, attracts wide attention with the rapid growth of multimedia applications. With the PIP functionality, the VES enables live viewing of the second channel, advanced video mosaics, next-generation video programming guides, and user-con gurable multi-view screens. The ap-plications are summarized as follows and the visual experience of these apap-plications is illus-trated in Figure 1.1.

Internet protocol television service (IPTV) such as video over IP (VoIP) [1] Internet/Mobile interactive applications such as video on demand (VoD) Commercial insertion [2]

Video conferencing [3] Video surveillance Entertainment [4]

(25)

Sec 1.1. Overview of Dissertation

(a) (b)

(c) (d) (e)

(f) (g)

(h) (i) (j)

Figure 1.1: The Applications of the Video Embedding Service: (a) Live TV Program of CBBC. (b) Leatek's WinFast Series Product. (c) Disney Supports "Full Motion, Picture-in-Picture Bonus Features" in Newly Released Blu-Ray DVD. (d) Multiple Stream Media Processor of DiscoveryBiz. (e) TV Setup Box of Allthings. (f) Sony Debuts In-Car Navigation with PIP in Japan. (g) Videoconferencing System of Wirered Company. (h) The PIP Functionality in Viliv X2 AIO. (i) YouTube Overlays Ads during Video Streaming (InVideo). (j) Video Surveillance System.

(26)

The video compression is essential to delivery and distribution of multimedia information. In most multimedia applications, video content is stored in the compressed format to mini-mize transmission bandwidth and storage requirement. With the superior coding ef ciency and network friendliness, the H.264/AVC [6] is regarded as the multimedia standard for ser-vice providers to deliver digital video contents over Local Access Networks (LAN), Digital Subscriber Line (DSL), Integrated Services Digital Network (ISDN) and third generation (3G) mobile systems [7]. Particularly, the next generation Internet protocol television service (IPTV) could be realized with H.264/AVC over very-high-bit-rate DSL (VDSL), which can support higher transmission rates up to 52 Mbps [8]. The high transmission bandwidth facilitates the development of video services with more functionalities and higher interactivity for video over DSL applications.

1.1.1 Motivation

To address the compressed video embedding technique over Internet and wireless channels, the idea of multiple-window video embedding transcoding (MW-VET) is proposed to deliver selected video contents that are encapsulated as one single bitstream. In such applications, the video may be transmitted over error-prone channels with limited bandwidth. To transmit the video contents via the single channel, the transcoder embeds downsized video frames into another frame with a speci ed resolution as the foreground pictures.

The most straightforward implementation of transcoding is the conventional cascaded pixel domain transcoder (CPDT) which is nothing more than a concatenation of several decoders and an encoder. It offers drift free performance with the highest computational cost. However, the complexity of the CPDT approach is relatively high at both algorithm and architecture lev-els. Furthermore, from the video compression perspective, the CPDT is inef cient because the

(27)

existing correlations between input and output bitstream are not utilized at all.

To address the high complexity issue of CPDT, our goal is to implement an H.264/AVC based VET transcoder while simultaneously (1) increasing throughput, (2) improving quality, (3) increasing rate-distortion (R-D) performance, (4) reducing cost, and (5) increasing cost-effective. First of all, a low-complexity transcoding algorithm is proposed while the R-D per-formance can also be improved as compared with the CPDT approach. Second, a highly ef-cient architecture is presented for combining H.264/AVC based video embedding transcoder (VET) and decoder such that the throughput performance can be further improved. Third, the design space is explored at a higher level of abstraction to nd the design the most cost effective alternative. Fourth, the micro-architectures are designed for the three hot spot modules within our system architecture. Speci cally, this dissertation will focus on several aspects as follows.

1.1.2 1

st

_{Focus: Low-Complexity Algorithm Development}

To maintain transcoded picture quality and to reduce the overall complexity, three transcoding techniques are presented for a multiple-window video embedding transcoder (MW-VET) at the algorithm level: 1) slice-group-based transcoding (SGT), 2) reduced frame memory transcoding (RFMT) and 3) syntax level bypassing (SLB). The application of each transcoding technique depends on the data partitions of the archived bitstreams and the paths of error propagation. For the slice-aligned data partitions, the SGT that composes the VET bitstreams at the bitstream level can provide the highest throughput. For the region-aligned data partitions, the RFMT ef-ciently re nes the prediction mismatch and increases the throughput while maintaining better R-D performance. For the blocks that are not affected by the drift error, the SLB de-multiplexes and multiplexes the bitstreams into a VET bitstream at the syntax element level.

(28)

the coding ef ciency, we use information from the incoming bitstream to make some simpli ca-tion during re-encoding. Particularly, we propose the moca-tion vector re-mapping and intra mode switching to eliminate the motion vector estimation and mode decision which are the most com-putational intensive functional blocks. On the other hand, to alleviate the quality degradation due to drift error while minimizing the computation complexity, we perform the error correction for some key blocks which occupies a small part of frame.

Our simulation results show that the proposed algorithm signi cantly reduces processing complexity by 25 times in terms of the time required to execute with similar or even higher R-D performance as compared to the conventional cascaded pixel domain transcoder. Particularly, the proposed algorithm can achieve up to 1.5 dB quality improvement in Peak Signal to Noise Ratio (PSNR).

1.1.3 2

nd

_{Focus: System Architecture Design Space Exploration}

To further increase the throughput performance, we implement the proposed low complexity algorithm by a highly ef cient architecture that combines H.264/AVC VET transcoder and de-coder. We use the ARM platform based design and partition the system into several dedicated modules to provide task level parallelism. In addition, the proposed architecture can perform video decoding and transcoding alternatively or simultaneously. Such VET enabled decoders can nd their applications in a peer-to-peer service community where every user enjoys versatile video embedded services while contributing partial computation power.

To solve the problems in traditional platform-based design, we implement most functional blocks by dedicated modules interconnected with the shared ping-pong memories to provide massive parallelism. First, all the computational intensive function block are partitioned as the hardware module such that the workload of ARM core can be signi cantly alleviated.

(29)

Sec-Sec 1.1. Overview of Dissertation

ond, we connect each module such that most of data communications take place through the video pipe instead of between ARM core and AHB bus. Thirdly, we allocate ping-pong buffer between adjacent pipeline stages such that computation and communication cycle can be over-lapped.

The system performance of a pipelined system mainly depends on the synchronization over-head. The pipeline is limited by the slowest module in the video pipe, while each module con-sumes certain cycles at each pipeline stage according to the various coding characteristics such that the synchronization overhead is introduced. However, the traditional bottom-up design methodology focuses on the individual performance of each module rather than considering the impact of synchronization overhead on the system level performance. Thus, in this thesis, we exploit the top-down design methodology to examine the effects of different design combina-tions with respect to system performance in order to explore the design space and ensure good tradeoffs between cost and performance. Such design space exploration enables us to nd an optimized tradeoff between cost and performance. In addition, we analyze the load balancing by normalizing resource utilization weighted by the associated cost. Another factor of synchro-nization overhead is the level of synchrosynchro-nization granularity. In general, coarser granularity of synchronization conducts less synchronization overhead at the expense of more memory. Most designs adopt macroblock-level pipeline for the inter prediction and deblocking lter for best throughput performance. In this dissertation, we will exploit the impact of different level of synchronization granularity on the synchronization overhead.

Lastly, the proposed architecture is veri ed and simulated at system level using transaction level modeling (TLM) technique. We implement the architecture to ensure the correctness of data ow at system level and perform the design space exploration for the two system parame-ters: (1) the level of synchronization granularity and (2) design combinations within the video

(30)

pipe. These models are evaluated at the TLM level and thus it minimizes the modeling effort and increases the simulation speed so as to explore 185 design alternatives. We then evaluate the system performance in terms of the average cycle count, the equivalent gate count and the cost weighted hardware utilization. From the system level simulation with TLM, the design alternative optimized for cost can reduce the area of previous design up to 25%. The design alternative optimized for speed can increase the throughput of previous design by 2 times at most such that our design can ful ll the real-time requirement for 1920x1080 @ 60 Hz videos when clocking at 135MHz.

1.1.4 3

rd

_{Focus: Highly Ef cient Micro-Architecture Design}

We further focus on the micro-architecture design of the three key modules in our proposed sys-tem architecture: (1) memory sub-syssys-tem, (2) inter and intra prediction (IIP), and (3) deblocking

lter (DF).

To ef ciently utilize the throughput of external DRAM, we propose an ef cient memory sub-system, including (1) an interleaved data arrangement scheme for improving the ef ciency of DRAM access, (2) an external memory interface for the control of mobile DDR SDRAM, (3) a synchronization buffer for data cache. Particularly, a synchronization buffer is employed as a bridge for reformatting the read/write data exchanged between the on-chip hardware and the off-chip DRAM. In addition, we optimize the issues of read/write commands and adaptively enable the auto-precharge function by monitoring the motion information of the input bitstream. To increase the ef ciency and the utilization, we propose a uni ed ltering architecture for the inter and intra prediction. First, the data paths are shared for both the inter and intra prediction so as to increase hardware utilization and reduce hardware cost. Second, to min-imize redundant computations in the pixel predictions, the FIR ltering is implemented by a

(31)

re-programmable systolic architecture (SA). Thirdly, the proposed systolic architecture is fully utilized for any kind of interpolation and block partition. Fourthly, a local FIFO and memory is allocated for temporally buffering the motion-compensated data and the intermediate data such that the motion-compensated data of a block partition is transferred without redundant trans-mission. According to the simulation results, our combined, systolic based architecture for the inter and intra prediction to achieve throughput up to 4.5 times while decreasing up to 60% of the bus bandwidth.

To reduce the memory and the latency for buffering and to explore the level of synchro-nization granularity, we propose a novel deblocking lter with ne-grained synchrosynchro-nization capability (FGSC). In the ner granularity such as 8x8 or 4x4 block, our proposed deblocking lter can more ef ciently process the input data at each pipeline stage than the other deblocking lter designs with macroblock-based processing order.

1.1.5 Attractive Applications of Video Embedding Transcoding

Figure 1.2 illustrates an application scenario for the VET that can be realized in practice ap-plication such as video on demand (VoD). Upon a client's request, the video server can use transcoding technique to compose two or more previously compressed bitstreams into one and then send the VET bitstream to the speci c client. The bit-rate of transmission bitstream be-comes much less because the secondary bitstreams are embedded directly instead of being car-ried additionally along the primary bitstream. Implementing the VES at the client side is neither cost effective as it requires multiple decoders at the client nor bandwidth-ef cient as it needs to transmit multiple bit streams to the receiver. In addition, the VET can provide compliant bitstreams for facilitating the VES video playback in a way transparent to the normal decoder at the client side.

(32)

IP

IP IPIP

Transcoding Server (Video Middleware)

Video Phone

Uplink H.264/AVC Bitstream

Downlink H.264/AVC VET Bitstream

Video Content Server

IP IP Video Phone

IP Surveillant Cameras

Video Phone Video Phone

Figure 1.2: One of the Applications of Video Embedding Transcoder: Transcoding Server.

1.2 Organization and Contribution

In this dissertation, a low-complexity transcoding algorithm is developd to deliver higher cod-ing ef ciency with limited driftcod-ing errors. Moreover, a highly ef cient system architecture is proposed with hybrid pipelined scheme to further improve transcoding throughput. Although the proposed schemes are mainly developed for Baseline Pro le, these techniques are shown to be extended and tailored for the Main Pro le and High Pro le in H.264/AVC. For more details of each part, the rest of this thesis is organized as follows:

Chapter 2 presents the related works on video embedding technique and shows the chal-lenges in H.264/AVC based video transcoding.

Chapter 3 proposes low-complexity algorithm that reduces the complexity of traditional transcoding scheme while improving the rate-distortion performance during transcoding. Speci cally, the contributions in the algorithm development include the following:

(33)

Sec 1.2. Organization and Contribution VET transcoding.

– The methodology of partially re-encoding is utilized and it not only reduces the complexity of transcoding, but also preserves the picture quality.

– Each coded block is classi ed into three types according to the predictor of a block. With this classi cation, an ef cient algorithm is employed such that the wrong ref-erence problem can be signi cantly alleviated.

– To exploit the correlations among bitstreams before and after transcoding while maintaining the coding ef ciency, we propose the intra mode switching and mo-tion vector re-mapping to eliminate the momo-tion vector estimamo-tion and mode decision which are the most computational intensive functional blocks.

– For the blocks that are affected by the drift error, only key blocks are effectively re ned rather than all the affected blocks such that the complexity can be further reduced.

– For the blocks that are not affected by the drift error, the original syntax elements are bypassed in the original bitstreams into a VET bitstream to eliminate the re-coding process that is computation-intensive and harmful for transcoded quality.

– As compared to the conventional cascaded pixel-domain transcoder, our algorithm increases the transcoding throughput by 25 times while providing 0.3 1.5dB PSNR improvement.

Chapter 4 demonstrates the problem of architecture design and presents the proposed highly ef cient system architecture where hardware ef ciency and utilization are signif-icantly increased. In addition, we utilize the top-down design methodology for design space exploration. Our contributions in the system architecture design space exploration include the following:

(34)

– A system architecture for H.264/AVC is proposed to combine video embedding transcoder (VET) and decoder.

– An architectural exploration on synchronization granularity. – A design space exploration on design balancing of video pipe.

– As compared to the traditional hardware design, more system design factor are ex-plored such that the best alternative according to the design constrains can be chosen effectively.

– As compared to the previous system design, the system performance can be in-creased by 2 times or reduce the memory requirement to 25% depending on the optimization object. Otherwise, we can obtain the design alternative with highest speed-cost ef ciency by tracing the Pareto frontier.

– The design alternative with highest speed can achieve the transcoding (or decoding) throughput of 1920x1088@60Hz while clocking at 135MHz.

Chapter 5 details the micro-architecture design for the three hot spot modules within our system architecture. Speci cally, the contributions include the following:

– To ef ciently utilize the throughput of external DRAM, a synchronization buffer is employed as a bridge for reformatting the read/write data exchanged between the on-chip hardware and the off-chip DRAM.

– To increase the ef ciency and the utilization, a uni ed ltering architecture is pro-posed for the inter and intra prediction that consumes less execution cycle on the average while the hardware cost is relatively low than the state-of-the-art designs. Furthermore, we provide an exploration on the parallelism of systolic based inter and intra prediction.

(35)

Sec 1.2. Organization and Contribution

we propose a novel processing order for the deblocking lter that provides ne-grained synchronization capability. In addition, the memory requirement of pro-posed deblocking lter is lower than macroblock-based deblocking lter.

– As compared to the conventional adder-tree based inter and intra prediction, our systolic-based design increases the transcoding throughput up to 4 times while the amount of data transmission via AHB data bus can be reduced by 65%.

Chapter 6 summarizes our works and illustrates the research activities in the future.

Lastly, Appendix A shows why transform domain approaches are inef cient for H.264/AVC based transcoding. Appendix B shows the constant-rate bumping process for displaying the decoded pictures during transcoding.

In summary, this dissertation focuses on the exploration at both algorithm and architecture level. Figure 1.3 illustrates the techniques to be discussed in this work and provides an ordered step by step of the process.

(36)

(37)

CHAPTER 2 Background and Related Work on Video

Embedding Transcoding

2.1 Introduction

This chapter presents the background and related work on video embedding technique and then shows the challenges in H.264/AVC based video transcoding. Video information embedding technique is essential to several multimedia applications such as picture-in-picture (PIP), multi-channel mosaic, screen-split, pay-per-view, multi-channel browsing, commercials and logo insertion and other visual information embedding services. For video embedding applications, the video embedding transcoding (VET) is essential to deliver multiple-window video services over one

(38)

transmission channel. The rest of this chapter is organized as follows: Section 2.2 explains why we realize the video embedding service by video embedding transcoding. Section 2.3 decribes the generalized problem of video transcoding. Section 2.4 formulates the wrong ref-erence problem in VET transcoding. Section 2.5 reviews some related works on VET Section 2.6 elaborates the challenges in H.264/AVC based VET. Lastly, Section 2.7 summarizes this chapter.

2.2 Realization of Video Embedding Service

The video embedding service can be realized at the client side where multiple sets of tuners and video decoders acquire video content of multiple channels to for one frame. The content delivery side sends all the bitstreams of selected channels to the client while the client side re-constructs the pixels with an array of decoders in parallel and then re-composes the pixels into single frame in the pixel domain at the receivers. Each receiver needs N decoders running with a powerful picture composition tool to tile the varying size pictures from N channels. Thus, the overall cost is increased as N is increased. To reduce the cost of the VET service, fast pixel composition and less memory access can be achieved based on the architecture design [9][10][11][12][12]. To realize the VET feature at the client side, the key issues are inef cient bandwidth utilization and high hardware complexity that hinders the multiple-window embed-ding applications deployment.

To increase the bandwidth ef ciency of bitstream transmission and reduce hardware com-plexity at client side, the video embedding service can be alternatively realized at the server/studio side to deliver selected video contents that are encapsulated as one bitstream. To transmit the video contents via the unitary channel, the MW-VET embeds downsized video frames into an-other frame with a speci ed resolution as the foreground areas. It can provide preview frames

(39)

Sec 2.3. Problem Statement of Video Transcoding

or thumbnail frames by tiling a two-dimensional array of video frames from multiple television channels simultaneously. With the MW-VET, users can acquire multiple-channel video con-tents simultaneously. Moreover, the MW-VET bitstreams are compliant to H.264/AVC and it can facilitate the multiple-window video playback in a way transparent to the decoder at the client side.

The challenges are to simultaneously maintain the best picture quality after transcoding, to increase the picture insertion exibility, to minimize the archival space of bitstreams and to reduce computational complexity. To optimize rate-distortion (R-D) performance, the bits of the newly covered blocks at the background picture are replaced by the bits of the blocks at the foreground pictures. To increase the exibility of picture insertion, the foreground pic-tures are inserted at the macroblock boundaries of processing units. To minimize the bitstream storage space, H.264/AVC coding standard is adopted as the target format. To decrease the computational complexity, a low-complexity algorithm for composition is needed. Therefore, we proposed a fast H.264/AVC based multiple-window VET (MW-VET), which encapsulates on-the- y multiple channels of video content with a set of pre-compressed bitstreams into one bitstream before transmission.

For real-time applications, video transcoding should retain R-D performance with the lowest complexity, minimal delay and the smallest memory requirement [13]. Particularly, the MW-VET should maintain good quality after multi-generation transcoding that may aggravate the quality degradation. An ef cient VET transcoder is critical to address the issue of quality loss.

2.3 Problem Statement of Video Transcoding

Generally, transcoding process could be viewed as the modi cation process of incoming residue according to the changes in the prediction. As shown in Figure 2.1 (a), the output of transcoding

(40)

is represented by Rn

0

= Q [HT (rn0)] = QfHT [xn P RED2(xj)]g

= Q_{fHT [r}n+ P RED1(xi) P RED2(xj)]g (2.1)

, where the symbols HT and Q indicate an integer transformation and quantization respec-tively. The symbols rn and rn0 denote the residue before and after the transcoding. The

sym-bols P RED1(xi) and P RED2(xj) represent the predictions from the reference data xi and

xj respectively. In this paper, we use the symbol “bar” above the variables to denote the

re-constructed values after decoding and the symbol “prime” to denote the re ned values after transcoding. The suf x of each variable represents the index of block. The process to embed the foreground videos onto the background can incur drift error in the prediction loop since the reference frames at the decoder and the encoder are not synchronized.

When the predictions before and after the transcoding are identical, Figure 2.1 (a) can be simpli ed to Figure 2.1 (b). The quantized data rnhas no further quantization distortion with the

same quantization step. Thus, the transcoded bitstream has almost identical R-D performance with the original bitstream as represented in Eq.(2.2).

Pd Pe rn = IHTfDQ fQ [HT (rn)]gg = rn (2.2)

, where the symbol Pe denotes the encoding process from the pixel domain to the transform

domain. The symbol Pd denotes the decoding process from the transform domain back to the

pixel domain. The symbols IHT and DQ mean an inverse integer transformation and de-quantization respectively.

(41)

Fig-Sec 2.3. Problem Statement of Video Transcoding

(a)

(b)

(c)

Figure 2.1: Illustration of a Novel Transcoder: (a) The Simpli ed Transcoding Process. (b) The Simpli ed Transcoder When the Prediction Blocks Are the Same. (c) The Fast Transcoder That Bypasses the Input Transform Coef cients.

(42)

ure 2.1 (c) where the data of the original bitstreams can be bypassed without any modi cation. It leads to a transcoding scheme with the highest R-D performance and the lowest complexity.

Video transcoding is intended to maximize R-D performance with the lowest complexity. Therefore, the remaining issue is to transcode ef ciently the incoming data such that picture quality is maximized with the lowest complexity. Speci cally, the incoming data are re ned only when the reference pixels are modi ed to alleviate the propagation error. To reduce com-putational cycles and preserve picture quality, the residue data with identical reference pixels are bypassed.

2.4 Wrong Reference Problem Formulation

Based on the occurrence of modi ed reference pixels and the paths of error propagation at the prediction loop, the MBs are classi ed into three types: w-MB, p-MB and n-MB. As shown in Figure 2.2, the small rectangle denotes the foreground picture (FG) and the large rectangle de-notes the background picture (BG). Each small square within the rectangle represents one MB. The w-MBs represent the blocks whose reference samples are entirely or partially replaced by the newly inserted pictures. The p-MBs represent the blocks whose reference pixels are com-posed of the pixels at w-MBs. The remaining MBs of the background pictures are denoted as n-MBs for the un-affected MBs. We observe that most of the MBs within the processing picture are p-MBs and only a small percentage of MBs are w-MBs. As for w-MBs, the coding modes or motion vectors of the original bitstream are modi ed to x the wrong reference problem. For the p-MBs, the wrong reference problem is inherited from the w-MBs. Thus, the coding modes and motion vectors are better to be re ned for each p-MB. All n-MBs' information in the original bitstream can be bypassed because the predictors before and after transcoding are identical.

(43)

Sec 2.5. Related Work on Video Embedding Transcoding

Figure 2.2: Illustration of the Wrong Reference Problem

2.5 Related Work on Video Embedding Transcoding

Depending on the operating domain, the transcoders can be classi ed as either pixel domain or transform domain approaches. In the following, we will review the most straightforward imple-mentation of VET transcoder and two fast VET transcoding algorithms for MPEG-2 standard.

2.5.1 Cascaded Pixel Domain Transcoder (CPDT)

The cascaded pixel domain transcoder (CPDT) cascades multiple decoders, a pixel domain composer and an encoder, as shown in Figure 2.3. It decompresses multiple bitstreams, com-poses the decoded pixels into one picture, and re-compresses the picture into a new bitstream. It offers drift free performance since the exhaustive re-encoding process of CPDT can avoid drift errors from propagating to the whole group of pictures.

However, the CPDT suffers from noticeable visual quality degradation and high complexity. Speci cally, the re-quantization process decreases quality of the original bitstreams. The irre-trievable quality degradation exacerbates especially when the foreground pictures are inserted at different time using the CPDT with multiple iterations. In addition, the cost of a CPDT is rela-tively high at both algorithm and architecture level since the computation-intensive re-encoding

(44)

PDC PIP Bitstream BG Bitstream H.264 Decoder FG Bitstream 1 FG Bitstream 2 FG Bitstream N H.264 Encoder PDC: Pixel-Domain Composition H.264 Decoder H.264 Decoder H.264 Decoder

Figure 2.3: The Architecture of the CPDT

process makes the signi cant complexity increase of the CPDT too costly for real-time video content delivery. From the video compression perspective, the CPDT is not ef cient because the existing correlations between input and output bitstream are not utilized at all. The complexity and memory requirement of the CPDT could be reduced with fast algorithms that exploit the correlations to remove transformation and motion estimation.

2.5.2 DCT Domain Transcoding with Motion Vector Re-mapping

The discrete cosine transform (DCT) domain inverse motion compensation (IMC) approach, which is proposed by Chang et al. [14][15][16], contributes to the DCT domain transcoding for MPEG-2 standard. The matrix translation manipulations are used to synthesis a DCT block that is not aligned to the boundaries of 8x8 blocks in the DCT domain. The DCT domain IMC takes less complexity than forward and inverse transform such that the transcoder can manipulate the bitstreams in the DCT domain to avoid transform back and forth. Chang's approach could achieve 10% to 30% speed-up over the CPDT. There are other algorithms to speed up the DCT domain IMC in [17][18][19].

(45)

Sec 2.6. The Challenge in H.264/AVC-based PIP Transcoding

The motion estimation can be eliminated with motion vector re-mapping (MVR) where the new motion vectors are obtained by examining only two most likely candidate motion vectors located at the edges outside the foreground picture. It simpli es the re-encoding process with negligible picture quality degradation. In addition, Chang re nes the residue of w-MBs and p-MBs to correct all the drift error caused by re nement error.

2.5.3 DCT Domain Transcoding with Backtracking

A DCT domain transcoder based on a backtracking process is proposed by Yu et al. [20] to further improve the transcoding throughput. The backtracking process nds the affected macroblocks (MBs) of the background pictures in the motion prediction loop. Since only a small percentage of the MBs at the background are affected, only the damaged MBs are xed and the unaffected MBs are bypassed.

In practice, for most effective backtracking, the future motion prediction path of each af-fected MB needs to be analyzed and stored in advance. To construct the motion prediction chains, Chang [14][15][16] completely reconstructs all the re ned reference frames in the DCT domain for each group-of-picture (GOP). With the motion prediction chains, the transcoder de-codes minimum number of MBs to render the correct video contents. The speedup of motion compensation is up to 90% at the cost of the buffering delay of the transcoder for one GOP period. The impact of the delay on the real-time applications depends on the length of a GOP in the original bitstream.

2.6 The Challenge in H.264/AVC-based PIP Transcoding

As compared to the previous standards, the H.264/AVC poses more challenges for video em-bedding transcoding as follows:

(46)

1. Inference mismatch: Since the mode and motion information of current block is inferred from the neighboring blocks, the foreground insertion will incur the inference mismatch and induce serious visual artifacts. Therefore, the syntax elements in the original bit-streams should be fully decoded, re-inferred, and re-encoded into the VET bitstream by an entropy encoder.

2. More sophisticated wrong reference problem: Due to the advanced prediction tools such as the variable block size, the various intra prediction modes and improved skip mode, the wrong reference problem becomes more complicated.

3. More complicated mode decision at the re-encoder side: Due to the variable block size and the various intra prediction modes, the mode decision at the re-encoder becomes more complicated.

4. More sophisticated motion vector re-mapping: When performing motion vector re-mapping technique, the H.264/AVC based VET transcoder should consider the impacts of 6-tap interpolation and the deblocking lter across the boundary between foreground and back-ground picture.

5. Transform-domain (HT-domain) IMC is inef cient: The existing approaches [17][18][19][20] convert the bitstreams that are of MPEG-2 standard in the transform domain for complex-ity reduction. However, application of the transform domain techniques to H.264/AVC is not feasible since the advanced coding tools including in-the-loop de-blocking lter, directional intra prediction and 6-tap sub-pixel interpolation all operate in the pixel do-main. The transform domain inverse motion compensation becomes inef cient when the motion compensation uses quarter-pixel resolution combined with 6-tap interpolation. In addition, the transformation and quantization processes in H.264/AVC are so optimized that traverse back to the pixel domain is not as expensive as before. Consequently, the

(47)

Sec 2.6. The Challenge in H.264/AVC-based PIP Transcoding

transform domain techniques have higher complexity as compared to the pixel domain techniques. As shown in the Appendix the pixel domain transcoder actually takes less complexity than the transform domain transcoder. The detail derivations are given in the Appendix A for brevity.

6. Transform-domain (HT-domain) operations cause the drift error: The transform domain manipulation introduces drift since the motion compensation is based on the ltered pix-els which are the output of the in-the-loop deblocking lter. The ltering operation is de ned in the pixel domain and cannot be performed in the transform domain due to its nonlinear operations [21][22][23]. In addition, the transform domain transcoding re-quires an inverse quantization process that introduces additional rounding error due to the operation of division. As a result, the transform domain transcoder for the H.264/AVC standard typically leads to an unacceptable level of error as shown in [24].

7. Backtracking method has slight bene t while introducing a delay of one GOP: The back-tracking method proposed by Yu [20] has no use for the H.264/AVC based transcoder due to the deblocking lter, the directional intra prediction and interpolation lter. Particu-larly, to track the prediction paths of H.264/AVC bitstreams, almost 100% of the blocks need decoding, which is over the 10% reported in [20]. Thus, the expected complexity reduction is limited. Furthermore, it introduces an extra delay of one GOP period. In summary, to speed up the CPDT, there are many fast algorithms to manipulate the incom-ing bitstreams in the transform domain. However, this is not the case for the H.264/AVC stan-dard. To our best knowledge, all the state-of-the-art transcoding schemes with H.264 as input bitstream format perform the fast algorithms in the pixel domain [21][25][22][23][26][27][28][29][30].

(48)

2.7 Summary

In this chapter, we present the background, related work, and the challenges in H.264/AVC based video embedding technique. First of all, we realize the video embedding service by video embedding transcoding so as to (1) increase the information acquisition, (2) lower the band-width requirement, (3) reduce the consuming power of client's devices, (4) provide H.264/AVC-compliant bitstreams, and (5) minimize the storage space of video archive. Second, we formu-late the wrong reference problem based on the occurrence of modi ed reference pixels and the paths of error propagation at the prediction loop. With this formulation, we are roughly aware of the problems in the video embedding transcoding. Third, we review the related work on fast transcoding algorithm for MPEG-2. As compared to the previous standards, the H.264/AVC poses more challenges for video embedding transcoding. Lastly, Section 2.6 elaborates sev-eral reasons to manifest the necessity of pixel domain manipulation for the H.264/AVC bit-streams. Therefore, we conclude that the spatial domain technique is a more realistic approach for H.264/AVC based transcoding. To resolve issues of low computational cost, less drift error and small memory bandwidth, we will develop the fast algorithm of a H.264/AVC-based video embedding transcoding in the spatial domain.

(49)

CHAPTER 3 Low-Complexity Algorithm of MW-VET

3.1 Introduction

This chapter proposes three low-complexity algorithm schemes of H.264/AVC multiple-window video embedding transcoder (MW-VET) for various interactive and non-interactive applications that require video embedding services including picture-in-picture (PIP), multi-channel mosaic, screen-split, pay-per-view, channel browsing, commercials and logo insertion, and other visual information embedding services. In particular, the MW-VET embeds multiple foreground pic-tures that are of smaller spatial resolution at macroblock-aligned positions. As the foreground bitstreams are encoded as full resolution, a downsizing transcoding [21][31][25] is needed prior to the VET transcoding. The spatial resolution adaptation transcoding has been widely

(50)

inves-tigated in the literatures and are not studied herein. In addition, we impose restrictions on the foreground bitstreams to remove the dependencies to the background. In particular, the fore-ground bitstream do not use un-restricted motion vectors and DC intra mode for the blocks at the rst column or the rst row of the foreground. The loss of R-D performance is negligible. In addition, we re-scale the DC coef cient of the rst DC block within an intra-coded frame based on the reconstructed values of neighboring pixels in the background. Except for the rst block, the foreground bitstream can be directly inserted into the new one.

According to the type of data partition and the encoding scenario of the archived bitstream, the MW-VET adopts one of three algorithm schemes including (1) slice group based ing (SGT), (2) no frame memory transcoding (NFMT), and (3) reduced frame memory transcod-ing (RFMT). As the prediction is applied to slice-aligned data partitions within the original bit-stream, the SGT simply merges the bitstreams into VET bitstream by parsing and concatenating the syntax elements and provide the highest transcoding throughput. As the prediction is ap-plied to the region-aligned data partitions and a corresponding auxiliary bitstream is available, the NFMT can compose VET bitstream by parsing, patching and concatenating the syntax ele-ments from the original bitstreams and the auxiliary bitstream. If there is neither slice-aligned data partitions within the original bitstream nor an auxiliary bitstream, the RFMT can ef ciently transcode the input bitstreams by three block level adaptive techniques based on the concept of partial re-encoding. The application of each transcoding scheme depends on the data partitions and encoding scenario of the archived bitstreams. Particularly, both the SGT and the NFMT are serviceable only in some speci c conditions thus restricting their application. Therefore, we focus on the algorithm of the RFMT in this chapter.

To maintain transcoded picture quality and to reduce the overall complexity, we present three transcoding techniques in RFMT: (1) intra mode switching (IMS), (2) motion vector

(51)

re-Sec 3.2. Slice-Group-Based Transcoding

mapping (MVR), and (3) syntax level bypassing (SLB). For region-aligned data partitions, the RFMT ef ciently re nes the prediction mismatch so as to adapt the prediction schemes com-pliant with the H.264/AVC standard and to increase the transcoding throughput while maintain-ing better R-D performance. Speci cally, when the prediction comes from the newly covered area without slice-group data partitions, the pixels at the affected macroblocks are transcoded with the RFMT based on the concept of partial re-encoding to minimize the number of re ned blocks. The RFMT employs intra mode switching (IMS) and motion vector remapping (MVR) to handle intra coded blocks and inter coded blocks respectively. For the pixels outside the macroblocks that are affected by newly covered reference frame, the SLB de-multiplexes and multiplexes the bitstreams into a VET bitstream at the bitstream level. Experimental results show that, as compared to the cascaded pixel domain transcoder (CPDT) with the highest com-plexity, our RFMT can signi cantly reduce the processing complexity by 25 times and retain the rate-distortion performance close to the CPDT. At certain bit rates, the RFMT can achieve up to 1.5 dB quality improvement in Peak-Signal-to-Noise-Ratio (PSNR).

In Table 3.1, we list all the symbol de nitions used in this chapter. The rest of this chapter is organized as follows: Section 3.2 and Section 3.3 presents the slice group based transcoding (SGT) and no frame memory transcoding (NFMT). Section 3.4 proposes the reduced frame memory transcoding (RFMT). Section 3.5 shows the simulation results. Finally, Section 3.6 summaries this chapter.

3.2 Slice-Group-Based Transcoding

The slice group based transcoding (SGT) is used when the prediction within the original bit-stream of background picture uses the aligned data partitions [31]. Based on the slice-aligned data partitions, the SGT operates at the bitstream level to provide the highest

(52)

through-Table 3.1: The Symbol De nitions Symbol Meaning

CAVLD Content adaptive variable length decoding CAVLC Content adaptive variable length coding LB Line buffer FM Frame memory DB De-blocking lter IP Intra prediction MC Motion compensation ME Motion estimation

HT & Q Integer transform and quantization

DQ & IHT De-quantization and inverse integer transform PDC Pixel domain composition

RDO MD Rate-distortion optimized mode decision MUX Multiplexer (syntax element selector)

put with the lowest complexity. The rationale is that H.264/AVC de nes a set of MBs to the slice group map types according to the adaptive data partition [6]. The concept of slice group is to separate the picture into isolated regions to prevent error propagation from lead-ing error resiliency and random access. Each slice is regarded as an isolated region as de ned in H.264/AVC standard. For each region, the encoder performs the prediction and ltering processes without referring to the pixels of the other regions.

For the video embedding feature using static slice groups, the large window denotes a back-ground slice and the embedded small windows denote foreback-ground slices. For example, a frame can be split into: a background slice and several rectangular foreground slices as shown in Fig-ure 3.1. After video embedding transcoding, all the slices are encoded separately at the slice level and encapsulated to one bitstream at the slice level. Based on archived H.264/AVC bit-streams with the slice groups, a VET can replace the syntax elements of MBs in the foreground slices with the syntax elements of other bitstreams with identical spatial resolutions. Therefore, all the syntax elements are directly forwarded as-is to the nal bitstream via an entropy coder.

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

國

立

交

通

大

學

電子工程學系 電子研究所

博 士 論 文

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

An Algorithm and Its Architecture Design Space

Exploration of a Video Embedding Transcoder

研 究 生：李志鴻

指導教授：蔣迪豪 教授

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

An Algorithm and Its Architecture Design Space

Exploration of a Video Embedding Transcoder

研 究 生：李志鴻 Student：Chih-Hung Li

指導教授：蔣迪豪 Advisor：Dr. Tihao Chiang

國 立 交 通 大 學

電子工程學系 電子研究所

博 士 論 文

視訊嵌入轉碼器之演算法與其硬體架構設計空間探討

研究生：李志鴻 指導教授：蔣迪豪 博士

國立交通大學

電子工程學系暨電子研究所

摘要

An Algorithm and Its Architecture Design Space

Exploration of a Video Embedding Transcoder

Student: Chih-Hung Li Advisor: Dr. Tihao Chiang

Department of Electronics Engineering & Institute of Electronics

National Chiao Tung University

ABSTRACT

Introduction

1.1 Overview of Dissertation

1.1.1 Motivation

1.1.2 1

Focus: Low-Complexity Algorithm Development

1.1.3 2

Focus: System Architecture Design Space Exploration

1.1.4 3

Focus: Highly Ef cient Micro-Architecture Design

1.1.5 Attractive Applications of Video Embedding Transcoding

1.2 Organization and Contribution

CHAPTER 2

Background and Related Work on Video

Embedding Transcoding

2.1 Introduction

2.2 Realization of Video Embedding Service

2.3 Problem Statement of Video Transcoding

2.4 Wrong Reference Problem Formulation

2.5 Related Work on Video Embedding Transcoding

2.5.1 Cascaded Pixel Domain Transcoder (CPDT)

2.5.2 DCT Domain Transcoding with Motion Vector Re-mapping

2.5.3 DCT Domain Transcoding with Backtracking

2.6 The Challenge in H.264/AVC-based PIP Transcoding

2.7 Summary

CHAPTER 3

Low-Complexity Algorithm of MW-VET

3.1 Introduction

3.2 Slice-Group-Based Transcoding

電子工程學系電子研究所

博士論文

研究生：李志鴻

指導教授：蔣迪豪教授

研究生：李志鴻 Student：Chih-Hung Li

國立交通大學

電子工程學系電子研究所

博士論文

研究生：李志鴻指導教授：蔣迪豪博士

_{Focus: Low-Complexity Algorithm Development}

_{Focus: System Architecture Design Space Exploration}

_{Focus: Highly Ef cient Micro-Architecture Design}