次微瓦適H.264/AVC之極高編碼效能位移估測研究

全文

(1)國立交通大學電子工程學系電子研究所碩. 士. 論. 文. 次微瓦適 H.264/AVC 之極高編碼效能動態估測研究 On sub-mW R-D Optimized Motion Estimation for Portable H.264/AVC. 學生：史彥芪指導教授：張添烜. 教授. 中華民國九十六年七月.

(2)

(3) 次微瓦適 H.264/AVC 之極高編碼效能動態估測研究 On sub-mW R-D Optimized Motion Estimation for Portable H.264/AVC. 研究生：史彥芪. Student：Yen-Chi Shih. 指導教授：張添烜. Advisor：Dr. Tian-Sheuan Chang. 國立交通大學電子工程學系電子研究所碩士論文. A Thesis Submitted to Department of Electronics Engineering Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics Engineering July 2007 Hsinchu, Taiwan, Republic of China. 中華民國九十六年七月.

(4)

(5) 次微瓦適 H.264/AVC 之極高編碼效能動態估測研究研究生: 史彥芪. 指導教授: 張添烜博士. 國立交通大學電子工程學系電子研究所碩士班. 摘. 要. 本論文旨在提出一次微瓦適編碼標準H.264/AVC之極高編碼效能動態估測研究。硬體設計以Artisan/TSMC 0.13µm (TSMC-CL013G-FSG) 製程實現核心面積 0.69mm2於工作電壓 1.2V 及時脈頻率 20MHz 時提供平均功耗僅 0.6 微瓦 (動態功耗 0.2 微瓦) 之CIF 30-f/s 即時影像編碼能力。複雜的編碼預測算術，H.264/AVC 大幅的增進習知影像編碼標準之編碼率-失真表現。肇因於動態估測為主要運算複雜度及計憶體存取頻寬，伴隨著編碼能力的提升，動態估測設計成為了影像編碼系統之實現瓶頸。近年之研究已提出為數眾多之即時、低成本動態估設計實現，然而此間之設計主要基於並非充分之設計指標及設計準則，過度以硬體設計成本為考量因此亦稱之為“硬體導向”設計。不適當的設計準則不但限制了設計易度且大幅犧牲系統效能表現。鑒於習知設計之謬誤，本論文致力提出於一次微瓦設計、具極高編碼效能之動態運算設計方法，該方法主要包含三種最先進之低能設計技巧：一、預先的巨區塊複製檢測：基於似然率檢定能預先於動態估測前判斷目前編碼巨區塊是否當為區塊複製預測，可有效的移除編碼運算冗餘進而節省操作功耗。二、適應性之搜尋範圍預測：使用適應性搜尋範圍之搜尋中心偏移區塊比對機制能大幅的增進習知技術之編碼算術效能及動態補償之有效性。三、內部記憶體最佳化設計：發明一基於製程之高抽象層級設計方法最小化記憶體存取電流修正習知設計之謬誤，大幅的降低資料緩衝處理之存取能耗。此外，低切換率之差絕對值邏輯設計能降低過去設計 50% 之邏輯面積與能量損耗，有效的減輕全算術平行化之硬體設計成本。為了進一步降低內部記憶體定址功耗，吾人以數學證明一最短距離編碼方法能不需額外設計成本而有效降低定址之邏輯切換率。基於位移向量編碼量主要影響影像編碼之位元數目，本論文亦提出一疊加器架構之位移成本算術方法，有效而低成本的進一步提升最多 15% 之編碼效能。. i.

(6) 藉由本文之設計方法，相較習知技術，吾人之設計已帶來極重要之功耗與編碼率-失真表現效益，在僅 198.8 微瓦之動態功耗與相同編碼失真條件下，最多將提升編碼效能達 50% 以上。簡言之，本文所提出之設計準則、演算法、及架構相較於習知設計能大幅的增進設計指標與系統效能，包含適於先進編碼實現之編碼效率、邏輯面積與功率損耗。. ii.

(7) On sub-mW R-D Optimized Motion Estimation for Portable H.264/AVC Student: Yen-Chi Shih. Advisor: Dr. Tian-Sheuan Chang Department of Electronics Engineering Institute of Electronics National Chiao Tung University. Abstract This thesis presents an exceptional motion estimator design for portable rate-distortion optimized H.264. The proposed design targets in processing capability of real-time CIF 30-f/s video with core area 0.69mm2 and 0.6mW (dynamic 0.2mW) power dissipation when worked at 1.2V and 20MHz under Artisan/TSMC 0.13µm process (TSMC-CL013G-FSG). Due to advanced prediction in complex arithmetic, H.264/AVC achieves significant rate distortion (R-D) improvement than prior standards. While the motion estimation predominates computation and memory access complexity, as increasing the capability, it becomes the most crucial component in the video codec. In recent, numerous studies were promoted to realize the cost-efficiency, real-time motion estimation. However, most of works are structured on the insufficient metrics and criteria, and so called hardware-oriented. Such improper limitations result in inherent restriction in design flexibility and significantly degrade on system performance. This therefore motivates us to demonstrate an elegant design methodology of achieving a sub-mW low-cost/high-performance motion computing. The proposed methodology includes three upmost power-efficiency techniques: 1. Early detecting on macroblock-skipping: novel likelihood ratio test (LRT) method effectively detects whether the MB should be SKIP coding prior to motion estimation, which efficiently exploits the computational redundancy and thus saves the power, 2. Adaptive prediction in search boundaries: biased block-matching scheme in use of dynamic boundary significantly improves arithmetic efficiency as well as motion compensated fidelity of prior arts, and 3. Perspective in optimizing memory structure: minimizing the access current using high level technology-dependent analysis rectifies assumed design fallacy, hence greatly suppressing power consumption in reference data buffering. In addition, the low switching AD (absolute difference) logic, using half the area and the power, is presented to successively eliminate full-parallelism design cost. To further minimize iii.

(8) the power of address switching, a shortest distance bus coding is mathematically proven cost-free and effectiveness. Since bit-stream size is largely affected by motion vector difference (MVD) in low rate coding, an accumulator-structured logic is then exhibited for motion vector cost arithmetic, which sufficiently further improves at most 15% of coding efficiency. Applied with the concluded methodology, our work supports dramatically power and rate distortion benefits over pervious studies, which has been summarized in average of 198.8 µW of dynamic power dissipation and more than half the bitrate improvement of equal frame PSNR compared to prior full searches. In brief, the present criteria, algorithms, and structures significantly improve design metrics and provide essential superiority than traditional designs, in terms of coding efficiency, silicon area and power consumption for advanced codec implementation.. iv.

(9) 誌. 謝. 理想如引路燈塔。沒有理想, 沒有可靠方向; 沒有方向, 又何來人生。 — Leo Tolstoy. 承蒙指導教授張添烜博士的耐心與殷切指導得以讓本論文順利完成。誠摯的感謝恩師黃永達教授與李宇旼博士, 老師們的學者風範及治學態度淺移默化學生對於探究學問的嚴謹與執著。感謝陳永昌教授與李鎮宜教授, 於百忙之中蒞臨口試指導, 畢使本論文更臻完善; 以及感謝助理李清音小姐在口試期間的熱心協助與提醒。特別感激母親曾詒翔女生不辭辛勞與耐心包容, 母親的諄諄教誨與默默支持, 是學生追求理想的主要動力, 感謝母親; 同時感謝身邊親友建議與鼓勵, 亦令學生獲益匪淺。最後, 僅將此論文獻給最敬愛的父親史君先生, 所有的榮耀與貢獻皆歸於您。. 史彥芪謹識于新竹九十六年七月. v.

(10) Contents Chinese Abstract. i. Abstract. iii. Acknowledgement. v. Contents. vi. List of Tables. ix. List of Figures. xi. 1 Introduction. 1. 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.2 Overview of Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.2.1. Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.2.2. Hybrid Video Coding Structure . . . . . . . . . . . . . . . . . . . .. 6. 1.3 Problem Briefs and Thesis Organization . . . . . . . . . . . . . . . . . . .. 8. 2 MB-skipping Detection. 10. 2.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1. Theoretical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 2.3.2. Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15. 2.3.3. The Decision Threshold . . . . . . . . . . . . . . . . . . . . . . . . 16. 2.3.4. Determination in Flexibility . . . . . . . . . . . . . . . . . . . . . . 18. 2.4 Logic Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 R-D performance and Detection Characteristics . . . . . . . . . . . . . . . 19 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Low Power Algorithms. 23. 3.1 The Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 vi.

(11) CONTENTS 3.1.1. Power in CMOS Logic . . . . . . . . . . . . . . . . . . . . . . . . . 24. 3.2 The Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1. Block-matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28. 3.2.2. Motion Estimation in H.264/AVC . . . . . . . . . . . . . . . . . . . 29. 3.2.3. Lagrangian Optimization . . . . . . . . . . . . . . . . . . . . . . . . 31. 3.3 Dynamic Block-matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1. Operations and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 34. 3.3.2. Proposed Boundary Prediction Method . . . . . . . . . . . . . . . . 36. 3.3.3. Simulation and Comparison . . . . . . . . . . . . . . . . . . . . . . 39. 3.3.4. Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 41. 3.4 Bit Truncation and Predictions . . . . . . . . . . . . . . . . . . . . . . . . 42 4 The Architectural. 45. 4.1 Array Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1. Graphic-based Design Methodology . . . . . . . . . . . . . . . . . . 46. 4.1.2. Array Design using Graphical Approach . . . . . . . . . . . . . . . 47. 4.1.3. The Proposed Array Processor. . . . . . . . . . . . . . . . . . . . . 52. 4.2 Absolute Difference Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1. Arithmetic Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 53. 4.2.2. Orientated Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 54. 4.2.3. Determination of the Reservation . . . . . . . . . . . . . . . . . . . 56. 4.2.4. The Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57. 4.3 Reference Buffer Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.1. The Data-Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58. 4.3.2. Access Current Minimization . . . . . . . . . . . . . . . . . . . . . 60. 4.3.3. Actual Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . 64. 4.4 Shortest Distance Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1. Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65. 4.4.2. The Shortest Distance Code . . . . . . . . . . . . . . . . . . . . . . 66. 4.4.3. SDC Structuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68. 4.5 MV Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5.1. H/W-oriented Block-matching Method . . . . . . . . . . . . . . . . 70. 4.5.2. Arithmetic Implementation . . . . . . . . . . . . . . . . . . . . . . 71. 5 Specification and Implementation. 74. 5.1 Design Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 I/O Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Timing Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Architecture Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 vii.

(12) Master Thesis, National Chiao Tung University 5.5 Proposed design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.6 Chip Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 Performance Assessments. 84. 6.1 Assessment Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Rate-Distortion Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 Power Dissipation Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4 The Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7 Conclusions. 94. Reference. 95. viii.

(13) List of Tables 1.1 Instruction profiling of an H.264/AVC baseline profile encoder at CIF (352 × 288) 30fps, 5 reference frames, search range [−16 : 15], QP=20 . . .. 2. 2.1 MSE simulation for residuals reconstruction (uncorrelated Gaussian with zero-mean, standard deviation 15) . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Detection characteristics and rate-distortion performance in integer resolution, compared with JM 8.6 (CIF@30fps, Baseline+RDO, QP={36, 42}) . 21 2.3 Detection characteristics and rate-distortion performance evaluation in fractional resolution, compared with JM 8.6 (CIF@30fps, Baseline+RDO) . . . 21 3.1 Chroma block sizes associated with luminance partitions . . . . . . . . . . 30 3.2 Arithmetic Operation and memory access bandwidth comparisons of full search BMA (N = 16, n = 8) . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Rate-distortion performance and complexity reduction, compared to full search of SR = 16 (CIF@30fps, QP = {30, 32, 36, 42}) . . . . . . . . . . . 41. 3.4 Comparison of power reduction/PE (CIF@30fps, 30frames, QP = {36, 42}). 43. 3.5 Comparison of rate-distortion (CIF@30fps, 150frames, QP = {30, 32, 36, 42}) 43. 3.6 MODE comparison between full-JM 8.6 and basic mode (QP=30, 32, 36, 42) 44 4.1 The truth table of carry logic . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Coding rate and distortion comparisons of r = 0, 1, 2 with respect to truncation bits = 3 (TBS = 3, n = 5), CIF@30fps 300frames . . . . . . . . . . 56 4.3 The area and average power comparisons (Artisan/TSMC-CL013G, clock rate = 50MHz, TAD = 2ns, and QP = {36, 42}) . . . . . . . . . . . . . . . 57. 4.4 The area and average power comparisons with different CIF test sequences. (Artisan/TSMC-CL013G, clock rate = 50MHz, TAD = 2ns, and QP = {36,. 42}) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58. 4.5 Average current list of Artisan n-bit×48words Mux-2 Register file working on 20MHz/30MHz (TSMC-CL013G process) . . . . . . . . . . . . . . . . . 62 4.6 High-level Memory Access Power analysis with different architectures (CIF 30fps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 ix.

(14) Master Thesis, National Chiao Tung University 4.7 Power consumption analysis for the proposed memory architecture (CIF 30fps, QP = {36,42}) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64. 4.8 Coding rate and distortion comparisons of motion vector cost (CIF@30fps, QP = {30, 32, 36, 42}). . . . . . . . . . . . . . . . . . . . . . 70. 4.9 Exp-Golomb code number and codeword structures . . . . . . . . . . . . . 72 5.1 I/O Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Functional diagram specification . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Cell area count after P&R (Artisan/TSMC-CL013G-FSG standard cell) . . 79 5.4 Core aspects of implemented chip . . . . . . . . . . . . . . . . . . . . . . . 82 5.5 Average power consumption (post-layout gate-level using PrimePower) . . 83 6.1 General encoding environment . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Specialized encoding environment . . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Test sequences characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.4 Rate-distortion performance and complexity reduction, compared to full search of SR = 16 (CIF@30fps, QP = {30, 32, 36, 42}) . . . . . . . . . . . 89. 6.5 Rate-distortion performance and complexity reduction, compared to full. search of SR = 16 (CIF@30fps, QP = {30, 32, 36, 42}) . . . . . . . . . . . 90. 6.6 Power & rate-distortion assessments for the proposed memory architecture. (CIF 30fps, QP = {36,42}) . . . . . . . . . . . . . . . . . . . . . . . . . . . 91. 6.7 Power & rate-distortion assessments of the proposed implementation (CIF. 30fps, QP = {36,42}) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92. 6.8 Power consumption comparisons of previous arts . . . . . . . . . . . . . . . 93. x.

(15) List of Figures 1.1 Illustrations of the spatial correlation (foreman-CIF) . . . . . . . . . . . .. 4. 1.2 Illustrations of the temporal correlation (stefan-CIF) . . . . . . . . . . . .. 5. 1.3 Advanced video coder block diagram and its coding flow . . . . . . . . . .. 6. 2.1 Sufficiency and necessity tests — MB-skipping using coded-blocks (QP = 36, CIF@30fps, Baseline+RDO). . . . . . . . . . . . . . . . . . . . . . . . 14. 2.2 Hypothesis occurrence of SKIP and CODE against max(SAD4×4 )/Qstep (CIF@30fps, Baseline+RDO) . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Flexibility analyses in maxk , block size, pixel-resolution, and SAD versus SSD (CIF@30fps, 300 frames, Baseline+RDO). . . . . . . . . . . . . . . . 18. 2.4 Logic implementation on macroblock-skipping detection . . . . . . . . . . . 20 2.5 Corresponded rate-distortion curves . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Partitioning of a MB for motion compensation . . . . . . . . . . . . . . . . 30 3.2 Arithmetic Operation versus memory access bandwidth in full search BMA (N = 16, n = 8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Frame INTRAs versus frame coding rate (foreman, CIF@30fps, QP = 36) 37 3.4 Frame INTRAs versus frame coding rate (stefan, CIF@30fps, QP = 36) . 37 3.5 Illustration of search boundary prediction using neighboring vectors . . . . 38 3.6 Illustration of proposed BMA using dynamic boundary prediction . . . . . 39 3.7 Relative Occurrence versus boundary distance, CIF@30fps, QP={36, 42} . 40 3.8 Boundary distance distribution, CIF@30fps, QP={36, 42} . . . . . . . . . 40 3.9 Log plot of Table 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Motion Search using BMA: search area versus search window . . . . . . . . 49 4.2 A simple DG of computations and data dependencies for 4 × 4 block-size. BMA (i, j rotated) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49. 4.3 Mapped signal flow graph with different s, p . . . . . . . . . . . . . . . . . 50 4.4 Variable size systolic array processor with ESAD 8 × 8 . . . . . . . . . . . 52 4.5 An equivalency arithmetic design of absolute difference value . . . . . . . . 54 xi.

(16) Master Thesis, National Chiao Tung University 4.6 An embodiment of proposed arithmetic design with 4 bit-length and 1 bit carry reservation, (n, p) = (4, 1) . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Rate-distortion performance comparisons in r = 0, 1, 2 with respect to truncation bits = 3 (TBS = 3, n = 5), CIF@30fps 300frames . . . . . . . . 57 4.8 Forward segment and backward segment . . . . . . . . . . . . . . . . . . . 67 4.9 An illustration of permutated codeword distance . . . . . . . . . . . . . . . 67 4.10 An illustration of SDC structure with k = 4, dmem = 12 . . . . . . . . . . . 69 4.11 Coding rate and distortion comparisons of motion vector cost (CIF@30fps, QP = {30, 32, 36, 42}). . . . . . . . . . . . . . . . . . . . . . 71. 4.12 Example of motion vector cost calculating, δ = 0 . . . . . . . . . . . . . . 73 4.13 Proposed MV cost arithmetic structure . . . . . . . . . . . . . . . . . . . . 73 5.1 The symbol View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 State #1: Data transferring timing for the case of macroblock-skipping detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 State #2: Data transferring timing for the case of candidate block matching 77 5.4 A simplified full timing example, including candidate block matching state. 78. 5.5 The architecture/block diagram of the proposed motion computing scheme. 80. 5.6 A simplified design and analysis flow for proposed Implementations and Verifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.7 OPUS Layout and Floorplaning . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Frame snapshot in the simulation interval. . . . . . . . . . . . . . . . . . . 86. 6.2 Rate-distortion curves of this work, JM, and prior full search . . . . . . . . 87. xii.

(17) Chapter 1 Introduction H.264 Advanced Video Coding (AVC) is an emerging next generation video coding standard which was approved by Joint Video Team (JVT) of ISO/IEC and ITU-T in 2003 [1]. The purpose of the AVC aims to cover all different bit-rate applications, for instance, video telecommunication, video photography, and broadcast-quality digital television.The framework of AVC is based on the hybrid motion-compensated structure, which had been proven the most successful class of video compression over past two decades. Based on the hybrid coding scheme, several improved key techniques in AVC significantly increase the compression efficiency for video coding. Some notable key advance over the pervious standards are enhanced SKIP mode, variable block size prediction as well as rate-distortion optimization (RDO). SKIP coding is a simplest temporal redundancy reduction technique, that has been used in the first digital video coding standard ITU-T Rec. H.120 [2] since 80’s. The enhanced SKIP in AVC replenishes depicted frame according to predicted displacement instead of original. SKIP coding with displaced motion is capable of further reducing the frame difference between depicted picture and imaging plan, and thus improves the R-D quality. Often, segmenting and predicting an area of coded block individually can result in a reduction in the amount of information that needs to be sent as a DFD. In AVC, variable block size predicting segments coded block into maximum 41 subdivisions. This flexibility provides diverse degree of DFD on motion-compensated prediction, and reasonably improves the coding efficiency. For all permitted coding modes, R-D optimization in AVC using Lagrangian method makes the best mode choice and parameter settings that substantially leads rate-distortion improvement. In practical, RDO can code with the same frame distortion as RD optimization off using 80% or less than 80% bit-stream size whenever coder encodes an motion picture. With involved modern coding techniques, JVT H.264/AVC has much better R-D quality and therefore suits various consumer devices on video coding. It is therefore these high-efficient coding features that greatly emerge implementation obstructions, such as hardware complexity, compression quality and power dissipation. The summary of oper1.

(18) Master Thesis, National Chiao Tung University. Table 1.1: Instruction profiling of an H.264/AVC baseline profile encoder at CIF (352 × 288) 30fps, 5 reference frames, search range [−16 : 15], QP=20 Functions Integer-Pel ME Fractional-Pel ME Interpolation Mode Decision Intra Prediction VLC T&Q Deblocking Total. Arithmetic. Controlling. Memory Access. MIPS. %. MIPS. %. MIPS. Mbyte/Sec. %. 95, 491.9 21, 396.6 558.0 674.6 538.0 35.4 3, 223.9 29.5. 78.3 17.6 0.5 0.6 0.4 0.0 7.6 0.0. 21, 915.1 14, 093.2 586.6 431.4 288.2 36.8 2, 178.6 47.4. 55.4 35.6 1.5 1.1 0.7 0.1 5.5 0.1. 116, 830.8 30, 084.9 729.7 880.7 585.8 44.2 4, 269.0 44.2. 365, 380.7 85, 045.7 1, 067.6 2, 642.6 2, 141.8 154.9 14, 753.4 112.6. 77.5 18.0 0.2 0.6 0.5 0.0 3.1 0.0. 121, 948.1. 100.0. 39, 577.3. 100.0. 153, 469.3. 471, 299.3. 100.0. MIPS: Million Instructions per Second, data from Chen et al., TCSVT’06 [3]. ating MIPS (Million Instructions Per Second) of baseline profile AVC at CIF 30fps video coding is illustrated in Table 1.1 [3]. Clearly, most computation-hungry functions are due to (variable block size) motion estimation. Motion estimation effectively exploits temporal redundancy in video compression by estimating of texture movement relative to the depicted frame. In a video codec, the motion estimation, significantly impacting on coded video quality and coded bit-stream size, demands over 70% of complexity and memory access requirements; therefore becomes a most crucial component for potable video coding application. For past decades, numerous algorithms and corresponded hardware designs have been proposed to realize the real-time motion estimation for portable application. Unfortunately, parts of design criterion and limitations inherently restrict design flexibility and thus lead the moderate degree of system performance degeneration. These motivate us to develop a novel qualified methodology of sub-mW low-cost and high performance ME approach.. 1.1. Motivation. Due to high complex decision process and high arithmetic tasks, AVC encounters the high computing and high data traffic challenges. To meet the stringent requirements on low power and high performance for the consumer electronics market, there is a clear need for optimized AVC VLSI implementation. In recent years, the demands for extremely low power and high-resolution mobile video coder are rapidly growing. Examples for such commercial devices are handset cellphone, 2.

(19) Chapter 1. Introduction digital compare (DC) and digital video camcorder (DV). Designs capable of real-time encoding for portable application are needed to support the creation of such digital video contents. An analysis of an H.264 baseline profile performed on CIF video coding already shown in Table 1.1 that H.264/AVC video encoding requires giga-operations per second (GOPS) and thousand-mega-byte scale memory access per second. To support the encoding on such resource demanding, the computation complexity is extremely high. The motion estimation is the focus of our work to achieve a high performance H.264/AVC design, since it is the most crucial task in H.264. Because the complexity of excellent full-search algorithm in motion estimation is extremely high, numerous hardware design studies have been proposed to reduce the computational complexity of full-search blockmatching algorithm (BMA). However, most to these implementations are structured on mal-assumptions and design criterion. The improper limitations result in inherent restriction in design flexibility and lead the significant degree in the system performance degeneration. Besides, multimedia potable devices that rely on batteries for energy support becomes more and more popular these days. Cellphones equipped with digital camera capable of transmitting real-time video are a major trend for cellphone development. Other applications like digital video camcorder (DV) trends to record higher resolution movies. Unfortunately, the advance in battery technology does not compel with the growth of power-hungry fancy video applications. In order to maintain acceptable operating hours with the same battery capacity, low power design for AVC becomes a most important issue. Such design obstructions and requirements lead to our motivation of exploring cost-efficient low power/high R-D performance algorithms and the corresponding VLSI architectures.. 1.2. Overview of Video Coding. Motion video data consists essentially of a time-ordered sequence of pictures, and cameras typically generate approximately 24, 25 or 30 frames per second. This results in a large amount of data that demands the use of compression. For example, assume that the video sequence is transmitted at the frame-rate 15 pictures/s and each picture has a low “QCIF” (quarter-common-intermediate-format) resolution (i.e., 176×144 samples) for that each sample is digitally represented with 8 bits. For color pictures, three color component samples are necessary to represent a sufficient color space for each pixel. In order to transmit even this relatively low-fidelity sequence of pictures, the raw source data rate is still more than 9 Mbits/s. However, the low-cost transmission channels often operate at much lower data rates so that the data rate of the video signal needs to be further compressed. In order to eliminate the data redundancy and to facilitate transmission or 3.

(20) Master Thesis, National Chiao Tung University storage, many video coding standards have been regulated in recent year, such as ISO/IEC MPEG-1/2/4, CCITT H.261/ITU-T H.263 and the ITU-T/ISO JVT H.264/AVC. The statistical analysis indicates that video scenes have strong correlation both between successive frames and within the picture themselves. The strong correlation between consecutive frames is called the temporal correlation (also called the temporal redundancy) and the strong correlation within the picture is called the spatial correlation (also called the spatial redundancy). Based on the two correlation classifications, the video compression is developed separately for the individual processing domain in common video coding technology. In the video signal, the data amount reduction achieved mainly by eliminating the spatial correlation, temporal correlation and the inter-symbol redundancy. In blow subsections, we briefly describe the fundamental concepts of the spatial, temporal and inter-symbol correlation in video coding. Spatial Correlation The spatial correlation dedicates the relationship between the adjacent samples (image pixels) in same frame. Figure 1.1 demonstrates an enlarged video frame example for spatial relationship. Obviously, the sample characteristic is usually similar to the neighboring pixels. In other words, the pixels at many locations can be predicted from the surrounding. The total entropy for data transmission can be greatly reduced via some useful prediction methods. One well-known technique is DPCM (differential pulse code modulation), which is wide used in many communication applications.. Figure 1.1: Illustrations of the spatial correlation (foreman-CIF). Temporal Correlation The temporal redundancy dedicates the high relationship between the successive frames. In order to generate the moving pictures, the sampling for the real world scene samples not only the spatial elements but also the scene over a period of time, and we define this 4.

(21) Chapter 1. Introduction period as the frame rate. Higher frame rate, more smooth object moving can be seen. For the human visual system, the frame rates should be higher than at last 15 frames per second to result in the continuous and smooth moving scene. Consequently, the difference between the successive frames is very small in case of the higher frame rate presents.. Figure 1.2: Illustrations of the temporal correlation (stefan-CIF) Figure 1.2 shows the snapshots of the CIF test sequence “stefan”. Each snapshot separates 5 frames. It obviously shows that texture difference between any consecutive picture is relatively small. Statistical Correlation An information-carrying signal always contains redundancy, which means that it exists a more efficient codeword in information representing. For example, characters within a text message occur with varying frequencies: in English, the letters E, T and A occur more often than the letters Q, Z and X. This makes it possible to translate message by representing frequently occurring characters with shorter codewords and infrequently occurring characters with longer codewords. This coding method using variable length codeword such as Huffman Code and Golobm Code is thus called variable length coding (VLC). Beside VLC, another technique, arithmetic coding, further exploits statistical correlation and much closely approaches entropy bound. In other word, it means arithmetic coding presents more compact information compression. Both VLC and arithmetic, frequently refer as entropy coding in video coding terminology.. 1.2.1. Video Coding Standards. Video compression techniques have played an important role in multimedia communication field. After several decades’ development, ISO and ITU organizations have regulated 5.

(22) Master Thesis, National Chiao Tung University. Encoded Frame. Split into Macroblocks. Encoding Bits. Coder Control/ Mode Decision. QP Quant. Coeffs. Transform & Quantization Decoder. Scaling & Inv. Transform. Entropy Coder Intra-Prediction. Reconstruction. Motion Compensation. Motion Estimation. Motion Vector. Figure 1.3: Advanced video coder block diagram and its coding flow MPEG series and H.26× series video coding standards to aim at the video compression. On typical application demands, it has different video coding standards to satisfy the specific purposes. For instance, H.261 and H.263 are presented by ITU-T (the International Telecommunications Union — Telecommunications Standardization Sector, then called the CCITT) for low-rate video phone or video conferencing application, and MPEG-1/2/4 are presented by ISO (international Organization for Standardization) for high bit-rate video entertainment application. MPEG-4 Part-10 H.264/AVC (Advanced Video Coding) is introduced by the Joint Video Team (JVT: The Joint Team of ISO/ICE and ITU-T) which aims to achieve different bit-rate applications and suitable for different Internet transmission environments.. 1.2.2. Hybrid Video Coding Structure. All video coding standards aforementationed are based on the hybrid coding architecture. The naming of hybrid is due to the picture reconstruction as a combination of motionhandling and picture-coding techniques, and the term codec is used to refer to both the coder and decoder of a video compression system. Figure 1.3 shows a coder using hybrid structure. The design and operation of such hybrid scheme should involve the optimization of a number of decisions, including 1. Properly replaces each coded block (e.g., MB) of the picture with completely INTRAframe content, 6.

(23) Chapter 1. Introduction 2. If not replacing an MB with new INTRA content (a) Properly segments each MB into sub-areas, (b) Performs sub-area motion estimation; i.e., searches the spatial shifting displacement to use for INTER-picture predictive coding, (c) Properly selects INTER-picture coding method (SKIP or prediction mode) for coed block according to the R-D quality of given coding method. (d) Performs DFD coding if prediction mode is applied; i.e., coding the INTER residuals as refinement of the INTER prediction, 3. Performs SKIP coding if SKIP is applied, i.e., repeats depicted frame as the replacement content on indicated block, and 4. If replacing an area with new INTRA content, coding the INTRA residuals as the replacement content. An INTRA-frame coding is similarly using an image-coding syntax as still image coding JPEG. Every segmented block using JPEG coding are transformed by a discrete cosine transform (DCT), and the DCT coefficients are then quantized and transmitted using entropy coding, such as VLC or arithmetic coding. This coding scheme is referred as INTRA-frame coding, since the picture is coded without referring to other pictures in the video sequence. However, improved compression performance can be attained by taking advantage of the large amount of temporal redundancy in video content. Such techniques that exploit the relationship of temporal correlation are referred to as INTERframe coding. Usually, much of the depicted scene essentially repeats in picture after picture without any significant change. It should be obvious then that the video can be represented more efficiently by coding only the changes in the video content, rather than coding each entire picture repeatedly. A simplest way for INTER-frame coding is CR method, which just repeats predictive picture on indicated area, and also refers to SKIP mode. However, INTER coding without residuals compensation has a significant shortcoming: inability of refining an approximation of picture content. Often the content of an area of a prior picture can be a good approximation of the new picture, needing only a minor alteration to become a better resemblance. Hence, adding the method of motion compensation, in which a refining frame difference approximation can be attended, results in a further improvement of compression performance. Most changes in video content are typically due to the motion of objects in the depicted scene relative to the imaging plane. A small amount of motion may accordingly result in a large difference in the values of the pixels in a picture area. In general, displacing an area of the predicted picture by a few pixels in spatial location can reflect a significant reduction on the amount of information that needs to be sent as DFD or a frame difference approximation. This use of spatial displacement to form a reconstruction is known as 7.

(24) Master Thesis, National Chiao Tung University motion compensation (MC), and the encoder’s search for the best spatial displacement is thus known as motion estimation (ME). The coding of the resulting difference, or the residuals, for the frame refinement of the MCP is referred as displaced frame difference (DFD) coding. Beside inter/intra prediction, the transform coding transforms the residuals into frequency domain due to frequency perceptivity of human visual system. The transformed coefficients followed with quantization further compact insignificant information. An Entropy coder, who eliminates statistical redundancy, is then adopted to compress prediction residuals into streaming and to count required bit rate after prediction and transform coding. In AVC, coder optimizes multi-block size prediction and hence significantly improves R-D quality. Therefore, decision best prediction based on the coded information, control/mode decision functional block becomes most important key in a coder. An optimized AVC can largely increase bit-rate efficiency and can mitigate channel overhead. In summary, a video coder using hybrid structure which is combined with INTRA and INTER frame coding techniques, efficiently eliminates spatial and temporal redundancy and thus compresses the video sequence with slight objective and subjective quality degradation.. 1.3. Problem Briefs and Thesis Organization. Intensive arithmetic in motion estimation predominates computation and memory access complexity in a video codec. As increasing the processing capability, it becomes the most crucial obstruction to realize a real-time portable coder. To demonstrate an exceptional motion estimation for portable H.264/AVC applications, the thesis organization of the proposed design methodology is concluded as follows. Chapter 2 To effectively eliminate complexity, after an induction to advanced video coding, chapter 2 reveals a macroblock-skipping method in use of likelihood ratio test, which ingeniously exploits the computation dependence and releases the complexity burden. Starting from the mathematics on statistic characteristics of Lagrangian optimization, a false rate constrained macroblock-skipping detection is proposed at maximizing the probability of detection by a graph-based approach. Based on the exploration, the proposed detection method moderately detects whether the current macroblock should be SKIP coding or not prior to motion estimation, which efficiently eliminates the computational redundancy and thus saves the power. Chapter 3 Motion estimation with exhaustive search provides a more superior degree of rate-distortion performance; however, it demands huge amount of system complexity and power dissipation. Motion estimation is usually found consuming over half the power in a 8.

(25) Chapter 1. Introduction video codec. To release the power demand, most implementation are strictly exploited the characteristics of hardware acts, for instance, the modularity and the regularity. These techniques primarily concerns about design metrics of data-reuse efficiency, memory bandwidth, and arithmetic utilization, thus so-called hardware-oriented algorithms. However, inapposite tradeoffs on these metrics leads to power-insufficient assessment as well as poor prediction fidelity. To resolve metric insufficiency problem at high level of design abstraction as well as the prediction reliability issue existing in pervious studies, a robust fast scheme with dynamic boundary decision is then presented to significantly improve the R-D quality, where the computational burden and corresponded power dissipation are both substantially preserved. In addition, two wide-used power-efficiency techniques, date depth truncation, and prediction simplification, well be examined in the chapter 3. Chapter 4 Based on system/algorithm development in chapter 3, the focus of chapter 4 will advance in design abstraction of power-efficiency architectures. To deal with most arithmetic intensive SAD computation, this chapter first theoretically investigates the array processor architecture by well-known graphic-based design methodology. Then a specialized arithmetic is derived which significantly eliminates switching power and area cost of absolute difference processing element (PE). In section 4.3, we further exhibits a novel memory structuring methodology using high-level technology-dependent power analysis, which minimize the access power due to internal memory buffering.To further minimize the power dissipation from address switching, we mathematically develop the shortest distance bus coding scheme in section 4.4. An ultra cost-efficiency logic structure is presented to facility motion cost computation, which effectively improves MCP fidelity and rate-distortion metric in low rate motion estimation in the end of this chapter. Chapter 5 Chapter 5 details design specification and implementation flow, including design characteristic, I/O, timing, and architecture specifications as well as implementation procedures and physical chip specifications. Chapter 6 Experimental assessments in rate-distortion and power measurement proceeds to be illustrated in this chapter. Two compared objects, JM full search and traditional hardware-oriented full search, were selected to assess the system performance in R-D and power. By simulation, it has shown that low-power design based on our studies is capable of using half the coding bitrate with better reconstructed quality and saves more than 90% of dynamic power over prior most advanced studies. Chapter 7 At the end of the thesis, concluding remarks are addressed in chapter 7.. 9.

(26) Chapter 2 Macroblock-skipping Detection To effectively eliminate codec complexity prior to arithmetic intensive processes, this chapter explores an low-cost macroblock (MB) skipping detection method using likelihood ratio test (LRT). Starting from the mathematics on statistic characteristics of Lagrangian rate-distortion optimization (RDO), a false rate constrained MB-skipping detection is proposed at maximizing the probability of detection by a graph-based approach. Based on the graphic exploration, the proposed method efficiently eliminates the computational redundancy without sacrificing rate-distortion performance. Experiments conclude that the 17%–87% probability of detection is archived at false rate 1% relative to motion activity. This chapter is organized into six parts. Section 2.1 briefs the motivations, and section 2.2 introduces related works on macroblock-skipping detection. A novel graph-based LRT approach is then presented in section 2.3, followed with its logic implementation. Performance evaluation is addressed in section 2.5. Finally, section 2.6 concludes this chapter.. 2.1. Background and Motivation. Significant rate-distortion (R-D) improvement is achieved at the expense of advanced prediction scheme in ITU H.264/AVC [4]. In portable application, image data are most encoded as SKIP coding due to limited channel capacity. Since decision the optimal prediction of each macroblock requires a series of transformations and reconstructions, the codec fairly wastes computational resource and power for encoding these macroblocks. Hence, AVC has the ability to precisely detect whether an MB should be skipped or not considerably decreasing computation redundancy and calculating power. To detect macroblock-skipping in advance, there are some methods that have been proposed. The method according to the magnitude of motion vector (MV) and sumof-absolutes-differences (SADs) was revealed in [5,6]. Two early detection methods of 10.

(27) Chapter 2. MB-skipping Detection estimation of Lagrangian cost prior to motion estimation was stated in [7,8]. Another similar cost estimation method was proposed based on the maximum a posteriori probability (MAP) hypothesis testing [9]. However, these detection methods suffered from either restricted detection sufficiency or the infeasibility of probabilistic characterization. Since coding statistics may vary diversely, generalizing a probabilistic model degrades the fidelity of implementation. Besides, the restricted sufficiency lowers probability of detection and leads ordinary complexity reduction. In fact, SKIP displacement utilizes spatial motion correlation, and only depends on neighboring vectors. It implies that the residuals can be obtained simultaneously with vector derivation, and a coder may decide the macroblock skipped or not using SKIP coding residuals. By this concept, we first analyze some properties and criteria of macroblockskipping detection based on the rate-distortion optimization (RDO) framework. To maximize the detection probability subject to an acceptable false rate, a likelihood ratio test (LRT) is then proposed by a graph approach. An adaptive threshold is also presented to constrain the false rate in different encoding scenarios according to the “CODE-detected” characteristic. The proposed method eludes the difficulty of probabilistic parameterization. While the direct test on SKIP coding increases the probability of detection, the computation burden can thus be efficiently saved. A coder with proposed algorithm conditionally eliminates the encoding complexity in terms of motion activity, without excessively degenerating R-D performance. For hardware implementation, we have mapped the proposed algorithm to n lowcost accumulator-based structure, which is fabricated by TSMC-CL013G process with 1K gates/7.5µW. Our low-complexity implement is easily incooperated into a coder to efficiently reduce computational power during frame encoding.. 2.2. Problem Formulation and SKIP Detections. Variable size prediction has greatly advanced in compression efficiency. The approved predictions of H.264 INTER P-frame are listed as follows, Sp =. (. INTRAs, SKIP, 16×16 16×8, 8×16, P8×8. ). (2.1). and the Lagrangian rate-distortion cost in test model JM (Joint Model) [10] is described as Eq. (2.2)1 . J(s, c, MODE|λ) = SSD(s, c, MODE) + λ · R(s, c, MODE) 1. Formally named Lagrangian cost, refer to 3.2.3 (pp. 31). 11. (2.2).

(28) Master Thesis, National Chiao Tung University where SSD is predicted distortion using sum-of-squared difference measurement between the encoded MB s and its reconstruction c associated with the prediction MODE. The Lagrange multiplier λ, related with quantization parameter (QP) considering all MBs, is chosen for mode decision of MB prediction. Function R maps the bits required of present mode. Mode-selection algorithm using Eq. (2.3) optimizes rate-distortion by assuming all macrobloks encoding independently, therefore referred to constrained RDO [11]. MODE∗ = arg min J(s, c, MODE|λ). (2.3). MODE. In portable application, such as videophone, video conferencing, the encoded bitrate is strictly constrained to meet channel capacity. This results in coding occurrence as SKIP selected much higher than others. Since the rate associated with a skipped coding is effectively zero, obviously, that SKIP is chosen implies SSD(s, p, SKIP) ≤ SSD(s, c, CODE). (2.4). where p indicates the predictions of replaced MB, and CODE, presents available mode opposed to SKIP, whose residuals remain to refine frame approximation. For convenience, we rewrite SSD of SKIP coding as Eq. (2.5), SSD(s, p, SKIP) = krY k2 + krU k2 + krV k2. (2.5). and SSD for other predictions as follows. SSD(s, c, CODE) = krY − r′Y k2 + krU − r′U k2 + krV − r′V k2. (2.6). SKIP coding residuals, r = s−p, present the differences between the MB and it prediction, suffixes Y, U, and V denote luminance and chrominances respectively, and the operator k · k, a generalized norm of an” N -by-M dimensional vector, is defined as kAk =. p tr(At A). (2.7). where A is an N-by-M dimensional vector, tr(·) indicates trace operation, and t denotes transpose. Eq. (2.6) presents the squared error of the coded-MB reconstruction, where vector r′ presents the reconstruction associated with residual r. In fact, a zero-coefficient block is followed by a zero reconstruction. Accordingly, a further case for macroblockskipping is that the estimated motion equals to the predicted of SKIP, and transformed predictions are all-zero when R-D cost of 16×16 mode is smaller than other modes excluding SKIP, i.e., CODE. In [5,6] as well as the fast mode decision algorithm in reference JM [12], these methods are presented based on the sufficient condition. However, constrained efficiency due to strict detection condition degrades the necessity. Similarly, the algorithms in [13,14] are intrinsically identical. In addition, requirement of motion estimation eliminates the feasibility as well. 12.

(29) Chapter 2. MB-skipping Detection In conventional video coder, since transformation such as Discrete-Cosine-Transform (DCT) is considerably energy conserved, we may assume that the mean-squared-error (MSE) after reconstruction of an all-zero coefficient block is smaller than the non-zero, as following E{kXk2 |T(X) = 0} ≤ E{kX − X′ k2 |T(X) 6= 0}. (2.8). where T presents the combined process of transformation and quantization in a coder. Table 2.1 compares the simulated reconstruction MSE relationship of H.264 baseline coder between all-zero coefficient block (All-zero) and non-all-zero (Non-zero) in different quantization step sizes (Qstep ). The stimuli are uncorrelated 4 × 4 Gaussian random vectors. with zero mean, standard deviation (σ) 15. While the energy of residual is relatively inneglectable as smaller step size, the reconstructed MSE of all-zero coefficients reaches saturation when QP =30.. Table 2.1: MSE simulation for residuals reconstruction (uncorrelated Gaussian with zeromean, standard deviation 15) QP. MSE. Qstep All-zero. Non-zero Relative (%). 30. 36. 42. 20 711.3. 40 1844.7. 80 3064.0. 741.3 95.94%. 2734.9 67.45%. 5843.8 52.43%. By the assumption of Eq. (2.8), the MSE relationship between the SKIP and CODE can be easily derived as E{ks − pk2 |T(rn ) = 0, ∀n} ≤ E{ks − ck2 |CODE}. (2.9). which implies that the expected R-D cost holds the inequality E{J(s, p, SKIP|λ)|T(rn ) = 0, ∀n} ≤ E{J(s, c, MODE|λ)|CODE}. (2.10). The inequality Eq. (2.10) has been applied in many fast inter mode decision algorithms as SKIP prediction in referenced JM. Two experiments of sufficiency and necessity of allzero block test are shown as Figure 2.1. Figure 2.1(a) shows the average predictions in a coded frame of the sequence “hall monitor”, where “SKIP” indicates the average number of SKIP present in a frame, “INTRA”presents prediction using {INTRA}, and “INTERMVs” is the rest of modes. The “CBs|SKIP” presents the necessity of SKIP detection in term of conditional probability of CBs given that encoded MB is skipped. Note that a block is called CB (coded block) if its partitions are non all-zero coefficients. The necessity test of another sequence foreman is shown as Figure 2.1(b). Although the 13.

(30) Master Thesis, National Chiao Tung University. SKIPs. 350. INTRAs. INTERMVs. 300. CBs|SKIP. SKIPs. 250 200. 400 350. SKIPs. 300. INTERMVs. 80% 70% 60%. 250. 50%. 200. 40%. 150. 30%. 100. N=0. N=1. N=2. Coded blocks. N=3. N=4. 100%. INTRAs. CBs|SKIP. 10%. 50. 0%. 0. 90% 80% 70% 60% 50% 40%. 150. 30%. 100. 20%. 50 0. 100% 90%. SKIPs. 400. 20% 10% 0% N=0. N=1. N=2 Coded blocks. N=3. N=4. (b) Test sequence: foreman. (a) Test sequence: hall monitor. Figure 2.1: Sufficiency and necessity tests — MB-skipping using coded-blocks (QP = 36, CIF@30fps, Baseline+RDO) necessity of the all-zero block testing in foreman is relatively obscure due to the variety of residuals energy, the all-zero block or, more general, coded block testing is effective enough for SKIP detection. Nevertheless, The all-zero block testing has less worth of use due to the requirement and dependence of transformation and quantization. To avoid transform coding, i.e., integrated DCT and quantization, all-zero block detection algorithms such as [15,16], may be used to early detect whether the residuals are all-zero quantized or not. These indirect detection methods, however, restrict the necessity of detecting. A severe degeneration of detection probability may be introduced and the burden of complexity won’t be remarkably released.. 2.3 2.3.1. Proposed Detection Method Theoretical Analyses. While an indirect approach restricts the necessity on detection, the direct method, such as hypothesis testing, may considerably extend the probability of detection. In a LRT problem, given a set of observations, a decision has to be made regarding the source of the observations. A general form of an LRT is as followed H1. Λ(z) ≷ η. (2.11). H0. where Λ(z) names likelihood ratio function, related with the observations z from the observation space {Z}. The test consists of comparing the ratio with a threshold and is. therefore referred as a likelihood ratio test. In general, we would like to make false rate, PF , as small as possible and probability of detection, PD , as large as possible, where PF 14.

(31) Chapter 2. MB-skipping Detection 1000. 300. 900. SKIP, QP=36 CODE, QP=36 SKIP, QP=42 CODE, QP=42. 800 600. 250. 500 400. CODE, QP=36. 150 100. 300 200. 50. 100. 0. SKIP, QP=36. 200 Blocks. Blocks. 700. 0. 2. 4. 6. 0. 8. 0. 2. 4. 6. 8. (b) Test sequence: stefan. (a) Test sequence: mother daughter. Figure 2.2: Hypothesis occurrence of SKIP and CODE against max(SAD4×4 )/Qstep (CIF@30fps, Baseline+RDO) and PD are defined as Eq. (2.12) and Eq. (2.13), respectively. ˆ PF = Pr {MODE = SKIP ∧ MODE∗ ∈ {CODE}}. ˆ PD = Pr {MODE = SKIP|MODE∗ = SKIP}. (2.12) (2.13). However, these are usually conflicting objectives. Instead of the opposition, we are therefore targeting on minimizing computational redundancy subject to an acceptable degree of encoding quality drop. A false rate α of PF is thus predetermined, while simultaneously maximize the probability of detection PD , and can be read as follows max PD subject to PF = α. (2.14). The optimized LRT based on the Eq. (2.14), maximizing PD subjtect to a given value PF , can be derived straightforward as Λ(z) =. fz|SKIP SKIP ≷ λ · PCODE fz|CODE CODE. (2.15). where the threshold of the test λ · PCODE is chosen to satisfy the constraint of PF . We thus have. α=. Z. fZ|CODE dZ =. ZSKIP. Z. fΛ|CODE dΛ. (2.16). ΛSKIP. and ΛSKIP , namely decision region of LRT, locates SKIP present.. 2.3.2. Proposed Algorithm. Image statistics may vary significantly even in same sequence. Hence, characterizing the probability to resolve the optimization problem as Eq. (2.16) is definitely mathematical. Instead of working with the difficulty in modeling of density functions, a graph approach operated in “receiver characteristic” of the likelihood ratio candidates is then revealed. This approach is well-known as the Most Powerful (MP) test [17]. 15.

(32) Master Thesis, National Chiao Tung University Figure 2.2 exhibits the hypothesis occurrence between SKIP and CODE within 100 frames encoding of CIF sequences “mother daughter” and “stefan”. The histograms demonstrate the encoded blocks distribution for hypotheses in normalized SAD value of block size 4 × 4 partitioned residuals. For both cases, it is clearly hard to characterize the hypothesis occurrence mathematically such as using curve fitting.. To bypass working with the difficulty in a mathematical manner, a graph approach is then proposed based on receiver operating characteristic (ROC) in this thesis. To resolve likelihood ratio testing of Eq. (2.11), there are some reasonable possibilities for the observations, z, and likelihood ratio, Λ. We specify the ratio function Eq. (2.17) as candidate for the following statements: Λ(r) = maxk {. X ij. ij |} |rY,n. (2.17). 1. Due to chroma subsampling, the luma residuals largely determine the Lagrangian R-D cost. 2. The displacing of SKIP is predicted alone with the neighboring MBs, i.e., independent of motion estimation of current MB. 3. All-zero coefficients can be prior detected by measuring SAD of the SKIP residuals. 4. Block partitioning and maxk SAD improve the flexibility on detection necessity. 5. The cost in SAD form is relative simple, and is easily incorporated with hardware implementation. In Eq. (2.17), n ∈ {0, 1, . . . , 256/N 2 − 1}, and the partitioning size N ∈ {4, 8, 16}. maxk. takes the k th ranked SAD in descending order. The resolution of motion vector is evaluated as well, since the quantity of the prediction error r largely depends on the accuracy of the replaced displacement. By the ROCs experiments demonstrated in section 2.3.4, The proposed SKIP detection LRT becomes max{. X ij. CODE. ij |} ≷ |rY,n. SKIP. κ · Qstep 2. (2.18). with each partitioned block size 8 × 8. The decision parameter, κ, adapts according to a period of encoding statistics.. 2.3.3. The Decision Threshold. The appropriate decision threshold can be dynamically approached according to a period of encoding statistics. Distinct from the general hypothesis problem, the true hypothesis can be evaluated while “CODE” is detected. A CODE is an opposite hypothesis of the SKIP indicated all predictions excluded SKIP. As the SKIP is detected, coder promptly terminates the encoding process and replaces a MB by SKIP coding; otherwise, current MB detection is failed, and the optimal prediction is made according to the Lagrangian 16.

(33) Chapter 2. MB-skipping Detection cost calculation. In the case that a series of CODE presents are detected but true hypothesis corresponded or called miss during a period of time, the likelihood ratio test is reasonably pessimistic; the threshold should be best increased. Reversely, considering the ratio value is spatial correlated, once the ratio for a CODE MB located in the alarm region, between η and η plus alarm metric, it reasonably implies threshold usually optimistic. Thus the decision region should be best restricted conservatively, in order to constrain the false rate within an acceptable value. According to the discuss, the algorithm is summarized in the following pseudo-code: //Inter-Frame Initialization set trials; set TRIALS; set kappa; set alarm_metric; //Inter-Frame Coding while(Inter-frame) { threshold = kappa*Qstep; SKIP_DETECTION(); MB_ENCODING(); if(SKIP detected){ if(SKIP present){ if(ratio > threshold+alarm_metric){ trials++; if(trials == TRIALS){ set TRIALS=0; kappa++; } //full trials } // beyond the alarm } //SKIP present } //SKIP detected else { set TRIALS=0; kappa--; } GET_NEXT_MB(); } //finish Inter-Frame Coding. The decision parameter is usually optimistic and has to be restricted in case of the difference between ratio and threshold is lower than the alarm metric. Otherwise, increase the decision parameter whenever the accumulative number of passed SKIP testing “trials” reaches the trial bound “TRIALS”. Alarm metric observes the necessity and sufficiency of decision threshold, and accordantly adapts threshold related to “CODEdetected statistics”. Based on the self-correct mechanism, the proposed LRT thus provides a efficient and effective MB-skipping detection regardless of probabilistic model characterization. The partial graph-based investigations for the flexible LRT parameters are demonstrated in the following. 17.

(34) Master Thesis, National Chiao Tung University. 2.3.4. Determination in Flexibility 1. 1. Probability of detection. Probability of detection. 0.9 0.8 0.7 0.6. max0 4-by-4. max1 4-by-4. 0.5 0.4 0.001. max2 4-by-4. 0.01 False Rate. 0.9. 0.8. max0 4-by-4. 0.7. max1 4-by-4. max2 4-by-4. 0.6 0.001. 0.1. 0.01 False Rate. (b) maxk -akiyo. (a) maxk -mother daughter 0.9. 0.8. 0.8. 0.7. Probability of detection. Probability of detection. 0.7 0.6. 0.6. 0.5. 0.5. SP 8-by-8 FP 8-by-8 SP 4-by-4 FP 4-by-4. 0.4 0.3 0.2 0.001. 0.01 False Rate. 0.3 0.001. 0.1. 0.1. 0.8 0.7. Probability of detection. 0.8 Probability of detection. 0.01 False Rate. (d) MV resolution-stefan. 0.9. 0.7 0.6 0.5. SAD 8-by-8 SSD 8-by-8 SAD 4-by-4 SSD 4-by-4. 0.4 0.3 0.001. SP 8-by-8 FP 8-by-8 SP 4-by-4 FP 4-by-4. 0.4. (c) MV resolution-foreman. 0.2. 0.1. 0.01. 0.6 0.5 0.4 0.3 0.001. 0.1. SAD 8-by-8 SSD 8-by-8 SAD 4-by-4 SSD 4-by-4 0.01 False Rate. 0.1. (f) MV resolution-stefan. (e) MV resolution-foreman. Figure 2.3: Flexibility analyses in maxk , block size, pixel-resolution, and SAD versus SSD (CIF@30fps, 300 frames, Baseline+RDO) The likelihood ratio assessment in terms of maxk , block size, displacement resolution, and distortion form were investigated based on the ROC graph approach.. 18.

(35) Chapter 2. MB-skipping Detection maxk Figure 2.3(a) and 2.3(b) show the partial ROC comparisons for maxk , with each partitioned size 4 × 4 and sub-pel SAD. In most cases, the maximum SAD value offers better operating characteristics than other partitioning, especially in static sequences.. The experiments succinctly reflect that SAD measuring in 4 × 4 size is sufficient to detect. SKIP coding.. Block size and vector resolution The valid candidates of block size and pixel resolution are {4 × 4, 8 × 8, 16 × 16} and full-pel or quarter-pel, respectively. Notice that comparisons excludes block size 16 × 16 due to poor detection efficiency. The ROCs were. compared in Figurer 2.3(c) and 2.3(d). It’s straightforward that prediction error largely determined with displacement accuracy; the resolution in fraction is more efficient than integer. In block size comparisons, however, the effect is relatively obscure. SAD versus SSD Conflicting of that improved performance is followed with the expense in complex computation, the detection performance may be degenerated with more complex squared calculation, as shown in Figure 2.3(e) and 2.3(f).. 2.4. Logic Implementation. To translate the ratio testing Eq. (2.18) into logic structure is quite straightforward. For low-cost implementation, we instead utilize an accumulator-based structure to avoid the multiplication in decision threshold, since κ always increment/decrement unless interframe initialization. A simple logic based on comparator for decision threshold computing is as Figure 2.4(a). For 1st encoded MB in P-frame, ∆κ is set to the initial value, and η accumulates until κ equals to the target value. The η ′ is defined as η plus with the alarm metric, setting to Qstep , for self-correctness of ratio testing. Figure 2.4(b) shows the proposed equivalent logic structure. The cumulative η is summed with ADs of each two pixels in current block and reference data, total four sub-blocks detected with each sub-block 8×8 pixels. Once current Λ exceeds η ′ , testing is failed, all logic computing terminated until next inter-frame MB. Notice that, for hardware simplification, the design logic shown in Figure 2.4 is integer-resolution; hence degenerated detection probability slightly.. 2.5. R-D performance and Detection Characteristics. This section presents the graph approach of LRT based on Eq. (2.18) in the proposed macroblock-skipping detection method. In the experiments, the performance of the proposed MB-skipping method was compared with baseline encoder JM8.6 and RDO mode enabled. 3 Static CIF sequences: akiyo, mother daughter, and hall monitor and 3 active 19.

(36) Master Thesis, National Chiao Tung University CB, RD data/2-pels. AD. AD. +. +. Qstep. _ready '. <<1. +. +. enable. k.tar. eq. +/- 1. refresh. C. testing. +. C. k.idx. C. D. _ready. refresh D MB Skipping. '. (b) Proposed low-cost LRT logic. (a) ACC-based threshold computing logic. Figure 2.4: Logic implementation on macroblock-skipping detection CIF sequences: foreman, mobile, and stefan were tested in flexibility, rate-distortion performance, and detection characteristics with one reference frame, one slice per frame, and P-slices only at encoding rate 30fps. All predictions are applied. The P-frame R-D performance evaluation factor ∆Bitrate and ∆PSNR in whole thesis are defined as ∆Bitrate(%) =. P Bitsproposed − P BitJM × 100% P BitsJM. ∆PSNR(dB) = P PSNRproposed − P PSNRJM. (2.19) (2.20). Table 2.2 summarizes R-D performance and detection characteristics in integer resolution of SKIP displacing for different encoding scenarios, where the testing false rate, α, is constrained to 1%. To further compare in sub-pel resolution, we summarize R-D performance and detection characteristics in Table 2.3 of detail QPs as 36 and 42. Figure 2.5 plots the corresponded rate-distortion curves of test sequence hall monitor and foreman. By experiments, the proposed macroblock-skipping method conditionally eliminates the computational redundancy of inter-frame coding, while the rate-distortion performance is substantially preserved. 20.

(37) Chapter 2. MB-skipping Detection. Table 2.2: Detection characteristics and rate-distortion performance in integer resolution, compared with JM 8.6 (CIF@30fps, Baseline+RDO, QP={36, 42}) Sequence(CIF). ∆Bitrate(%). akiyo mother daughter hall monitor foreman mobile stefan. ∆PSNR(dB). −1.40% −0.75% −4.16% −0.22% −0.42% −0.28%. PD (%). −0.020 −0.047 −0.070 0.013 −0.010 −0.004. PF (%). 86.88% 86.99% 81.07% 54.67% 16.52% 44.77%. MB-skipping(%). 0.99% 1.07% 1.84% 0.88% 1.26% 1.23%. 81.77% 78.25% 74.62% 35.88% 9.52% 22.25%. Table 2.3: Detection characteristics and rate-distortion performance evaluation in fractional resolution, compared with JM 8.6 (CIF@30fps, Baseline+RDO) Sequence (CIF) akiyo M&D hall monitor foreman mobile stefan. 2.6. QP. ∆Bitrate (%). ∆PSNR (dB). Probability of detection (%). False rate (%). 36. -0.63%. -0.019. 82.97%. 1.00%. 42 36. -2.17% -0.53%. -0.021 -0.023. 90.78% 75.03%. 0.98% 1.20%. 42 36. -0.96% -4.48%. -0.071 -0.068. 98.95% 70.36%. 0.93% 1.96%. 42 36. -3.84% -0.14%. -0.072 0.010. 91.78% 50.57%. 1.72% 0.91%. 42 36. -0.29% -0.10%. 0.016 -0.012. 58.76% 14.83%. 0.84% 0.99%. 42 36. -0.73% -0.15%. -0.008 0.004. 18.20% 44.32%. 1.52% 1.07%. 42. -0.41%. -0.012. 45.21%. 1.38%. Conclusion. To exploit the coding characteristic in H.264 low-rate application, based on the Lagrangian RDO framework, this chapter suggests a low-complexity LRT-based algorithm and corresponded architecture for MB-skipping detection. Theoretical analyses state some MB-skipping detection criteria and corresponded methods according to R-D cost properties and the assumption of encoding statistics. To maximize the probability of detection subject to a acceptable false rate, a LRTbased MB-skipping detection technique is therefore presented by a graph investigation 21.

(38) Master Thesis, National Chiao Tung University 36.00. 36.00. 35.00. 35.00. 34.00. 34.00. 33.00. PSNR (dB). PSNR (dB). 37.00. 33.00 32.00 31.00 30.00. JM86. 29.00. Proposed ESD. 28.00. 0. 1000. 2000 3000 4000 Bitrate (P-frames). 5000. 32.00 31.00 30.00. JM86. 29.00. Proposed ESD. 28.00 2000. 6000. 3000. 4000. 5000 6000 7000 Bitrate (P-frames). 8000. 9000. 10000. (b) Test sequence: foreman. (a) Test sequence: hall monitor. Figure 2.5: Corresponded rate-distortion curves approach regardless of the infeasibility of probability parameterization. The self-correct decision threshold is achieved at the constrained rate of false detection related to the “CODE-detected” characteristics. Simulation results indicate that the proposed MBskipping detection has the probability of detection 16%–86% at the false rate 1%, and the corresponded skipping rates are 10%–81%, where the logic count of the corresponded architecture is about 1Kgates.. 22.

(39) Chapter 3 Low Power Algorithms Motion estimation demands on prime power and visual quality, which becomes a most crucial component to realize a video codec. In recent years, a verity of related studies on low power algorithms have been proposed to cross the obstructing of handset device implementation. Design in higher levels of abstraction is clearly to provide the more degree of freedom both in terms of design intent and constraints. Yet most of these design suffered from the motion-compensated prediction (MCP) reliability due to the excess attention on hardware-oriented considerations, and hence essentially affects system performance. To elegantly resolve the reliability problem, this chapter mainly deals with the abstraction on motion estimation algorithms having excellency in rate-distortion performance. An advanced estimation strategy using dynamic boundary decision is proven significantly improving the R-D quality. In addition, the associated computational burden and power dissipation are both substantially preserved. These key advantages have made the proposed algorithm successively brigading the gap between requirements of low power and high coding performance in a video codec. This chapter is comprised of four sections. Section 3.1 introduces the power measurement used in digital CMOS circuits. Section 3.2 describes the block-matching motion algorithms and the relative deign issues as an example implemented in H.264 reference software. Section 3.3 demonstrates a high quality motion search algorithm using dynamic boundary decision. Lastly, in section 3.4, the design parameters for well-known bitwidthtruncated technique as well as block size of predictions are both analyzed to further reduce the power dissipation and implementation overhead.. 3.1. The Power Dissipation. It was not until recent years that power consumption was only a secondary metric of concern in comparison to area and speed. This thinking has begun to change and power becomes most crucial factor than area and speed followed with phenomenal growth of 23.

(40) Master Thesis, National Chiao Tung University portable electronics.. To clearly state the power consumption, this section briefs an. overview of the sources of power in digital CMOS circuit. The power consumption in digital CMOS circuits is typically formulated as follows [18] Pavg = Istandby Vdd + Ileakage Vdd + Isc Vdd + Psw.cap.. (3.1). where • the standby current Istandby is the DC current drawn continuously form the power supply (Vdd ) to ground,. • the leakage current Ileakage is primarily determined by the fabrication technology, caused by. – the reverse bias current in the parasitic diodes formed between source and drain diffusions and the bulk region in a MOS transistor, and – the subthreshold current that arises from the inversion that exists at the gate voltages below the threshold voltage, • short-circuit current Isc is due to the DC path between the supply rails during output transitions, and. • the last term, Psw.cap., refers to the capacitive switching power dissipation, caused by the charging and discharging of parasitic capacitances in the circuit.. The Static and Dynamic Power In general, power consumed in a digital circuit can be more simply classified into static power dissipation and dynamic power dissipation. • The term static power dissipation refers to the sum of the standby and leakage power dissipations. Leakage currents in digital CMOS circuits can be reduced with. the proper choice of device technology. Standby currents play an important role in design styles like pseudo-nMOS and nMOS pass transistor logic and in memory cores. • The term dynamic power dissipation refers to the sum of the short-circuit and capacitive switching power. The dynamic power usually accounts over 70% of the overall power in a non-specialized ASIC (application-specific IC). Each dissipated term has further briefed in the following.. 3.1.1. Power in CMOS Logic. The Standby Current The standby current and hence standby power consumption occurs when both the nMOS and the pMOS transistors are continuously on. This could happen, for example, in a 24.

(41) Chapter 3. Low Power Algorithms pseudo-nMOS inverter, when the drain of an nMOS transistor is driving the gate of another nMOS transistor in a pass-transistor logic, or when the tri-stated input of a CMOS gate leaks away to a value between power supply and ground. The standby power is equal to the product of Vdd and the DC current drawn from power supply to ground.. The Leakage Current The leakage current is decomposed into two components that are shown in the following equation Ileakage = Idiode + Isubthreshold. (3.2). The term Idiode refers to the currents following through the reverse biased diodes that are formed between the diffusion regions and the substrate. This term is proportional to the area of the source or drain diffusion and the leakage current density and is typically in the order of 1pA for a 1micron technology. The term Isubthreshold refers to the currents arising due to the fact that transistors that are “off” conduct some non-zero current. Leakage current becomes a larger and larger problem as geometries shrink and threshold voltages drop. The leakage current in a .13µm process with a threshold voltage of 0.7V is about 10–20pA per transistor. In that same process, if the threshold voltage is lowered to 0.2–0.3V, then leakage current skyrockets to 10–20nA per transistor. For a 1M transistor chip, leakage power can increase from 15µW to 15mW due to a lower threshold. In the practical measurement, the leakage power in our chip implementation is about 400µW with core area 0.69µm2 in a 0.13µm technology. Leakage current depends on the Vdd (or how close it is with respect to threshold voltage), threshold voltage itself (Vt ), the transistor aspect ratio (W/L) and the temperature. Leakage power used to be only 5% for technologies 0.18µm and above. As the voltage scales down with technology, this has increased exponentially and has become a problem in nano-meter technologies. For instance, leakage power occupies major proportional in average of 68% in this work. Increasing die area affects the leakage power adversely, as this causes a number of transistors to be increased.. The Short Current The short current is caused by direct supply-to-ground paths that are created due to transents in signal values cause. For instance, when the input of a CMOS inverter changes from logic 1 to 0, there is a period of time when both the nMOS and pMOS transistors are conducting, leading to a short-circuit current being drawn from the supply. Assuming symmetric rise and fall delays and threshold voltages, the short-circuit power dissipation 25.