MPEG-4物件視訊編碼器在PACDSP平台上之軟體實現

全文

(1)國. 立. 交. 通. 大. 學. 電子工程學系電子研究所碩士班碩. 士. 論. 文. MPEG-4 物件視訊編碼器在 PACDSP 平台上之軟體實現. Software Implementation of MPEG-4 Object-based Video Encoder on PACDSP Platform. 研究生 : 江政達指導教授 : 林大衛博士中華民國九十六年六月.

(2)

(3) MPEG-4 物件視訊編碼器在 PACDSP 平台上之軟體實現. Software Implementation of MPEG-4 Object-based Video Encoder on PACDSP Platform. 研究生 : 江政達指導教授 : 林大衛博士. Student: Cheng-Ta Chiang Advisor: Dr. David W. Lin. 國立交通大學電子工程學系電子研究所碩士班碩士論文 A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical and Computer Engineering National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics Engineering June 2007 Hsinchu, Taiwan, Republic of China. 中華民國九十六年六月.

(4)

(5) MPEG-4 物件視訊編碼器在 PACDSP 平台上之軟體實現. 研究生: 江政達. 指導教授:林大衛博士. 國立交通大學電子工程學系. 電子研究所碩士班. 摘要. MPEG-4 為一廣泛應用之多媒體訊號壓縮標準。本篇論文介紹在 PACDSP v3.0 平台上 MPEG-4 物件視訊編碼器之實現，本平台由一超長指令數位訊號處理器與一 ARM926EJ-S 處理器所組成。為了最佳化程式流程，我們也完成了許多的靜態分析，並且利用超長指令處理器架構上之特性來達到即時編碼。我們已可在 ARM 平台上呈現簡單的展示，並在指令集模擬器上驗證 DSP 部分之正確性。在我們的實作當中，我們使用了 MPEG-4 參考軟體，MoMuSys，當作驗證的比較對象。首先，我們分析了 MPEG-4 物件視訊編碼器之統計特性並且對編碼流程有了初步的瞭解。接著，我們分析編碼之運算複雜度並且藉此找到有效率的實現方法。在移動估測編碼中，我們利用螺旋搜尋法中的一項參數來降低運算複雜度，並且沒有犧牲太多的影像品質。在形狀編碼中，我們使用多重符號之內容基礎的算術編碼(CAE)來壓縮二元形狀資訊，並在 inter 編碼模式中做調整以降低運算複雜度。在紋理編碼中，我們根據離散餘弦轉換(DCT)之特性來跳過多餘的運算。.

(6) 為了加速執行時間，我們將規律之運算分佈於兩組以增加處理器之效能。我們也使用單指令多資料(SIMD)指令以及一般指令層級平行化來減少處理器之延遲。我們討論了離散餘弦轉換(DCT)和離散餘弦反轉換(IDCT)之效能與精確度，並且我們的離散餘弦反轉換(IDCT)實現能夠符合 IEEE 1180-1190 標準之規範。在所有的最佳化之後，我們在最好的情況下可分別在 intra 和 inter 編碼模式下達到每秒 33 和 43 張的 QCIF 畫面即時編碼。而整個程式的大小為 27 Kbytes，也小於 PACDSP 的程式快取記憶體大小 32 Kbytes。在本篇論文當中，我們首先介紹了 MPEG-4 標準以及 PADSP 平台之概述。接著討論靜態分析、最佳化方法、整體實作設計、以及實驗結果。最後簡單介紹了雙核心實現的系統與機制。.

(7) Software Implementation of MPEG-4 Object-based Video Encoder on PACDSP Platform. Student: Cheng-Ta Chiang. Advisor: Dr. David W. Lin. Department of Electronics Engineering Institute of Electronics National Chiao Tung University. Abstract MPEG-4 is a widely-applied multimedia coding standard. This thesis presents an implementation of MPEG-4 object-based video encoder on the PACDSP v3.0 platform, which consists of a VLIW digital signal processor (DSP) and an ARM926EJ-S processor. We complete many analysis to optimize the program flow and utilize the advantage of VLIW processor to achieve real-time encoding. We have done a simple demonstration on ARM core, and the encoding on DSP part is verified by instruction set simulator. In our implementation, the MPEG-4 reference software, MoMuSys, is used as a golden model to verily our implementation. First, we analyze the statistics of the MPEG-4 object-based video encoder, and have an initial understand of the encoding flow. Second, we analyze the computation complexity of the coding, and find efficient algorithms for the implementation. In the motion coding, we use a parameter of spiral search to simplify the computation complexity without too much quality loss. In shape coding, we use multi-symbol CAE to compress the binary shape information and give some modification for inter mode coding to reduce computation complexity..

(8) In texture coding, we skip some computations according to the mature of discrete cosine transform (DCT). Third, to speed up the execution time, we distribute the regular computations to both clusters to increase the efficiency of the processor. Single instruction multiple data (SIMD) instructions and general instruction level parallelism also utilized to reduce the processor stalls. We also discuss the efficiency and accuracy of DCT and IDCT, and the accuracy of our IDCT implementation can meet the IEEE 1180-1190 standard. After all the optimizations, we can encode the MPEG-4 video data for QCIF format over 33 and 43 frames per second in the best case for intra and inter encoding. The code size is 27 Kbytes, which is smaller than the 32-Kbyte instruction cache on PACDSP. In this thesis, we first introduce the MPEG-4 standard and give an overview of the PACDSP platform. Then the static analysis, the optimization methods, the overall implementation design, and the experiment results are discussed. Finally, we brief the system and mechanism for the dual-core implementation on the PACDSP platform..

(9) 誌謝本篇論文的完成，誠摯地感謝我的指導老師林大衛博士，從踏入交通大學電子所開始，多虧老師的循循善誘，不但給予我在課業、研究上的幫助，使我學到了分析問題及解決問題的能力。同時老師樂觀的生活態度也影響了我，讓我更有勇氣面對各種困難。在此，僅向老師及老師的家人致上最高的感謝之意。另外要感謝的，是實驗室的蔡崇諺學長和吳和璋學長。謝謝你們熱心地幫我解決了許多方面的疑問。感謝通訊電子與訊號處理實驗室(commlab)，提供了充足的軟硬體資源，讓我在研究中不虞匱乏。感謝崑健、俊榮、鴻志、家揚、朝雄…等博班學長的指導，以及 94 級介遠、志岡、柏昇、耀鈞、順成、凱庭、錫祺、浩廷、育成、耀仚等實驗室成員，平日和我一起唸書，一起討論，也一起打混，讓我的研究生涯充滿歡樂又有所成長。期待大家畢業之後都能有不錯的發展。最後，要感謝的是我的家人，他們的支持讓我能夠心無旁騖的從事研究工作。另外感謝我的女友，牛怡婷，在我的求學過程一路相伴，面對壓力時不斷地鼓勵。謝謝所有幫助過我、陪我走過這一段歲月的師長、同儕與家人。謝謝！誌於 2007.6 風城交大政達.

(10)

(11) Contents 1 Introduction. 1. 2 Overview of the MPEG-4 Video Standard. 3. 2.1. Structure of MPEG-4 Video Data . . . . . . . . . . . . . . . . . . . . . .. 3. 2.2. MPEG-4 Video Texture Coding . . . . . . . . . . . . . . . . . . . . . .. 6. 2.2.1. VOP Formation . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2.2.2. Shape Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.2.3. Motion Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 2.2.4. Texture Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 2.2.5. Other Video Coding Tools [6] . . . . . . . . . . . . . . . . . . .. 21. Profiles and Levels [5] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 2.3. 3 Overview of PACDSP. 25. 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 3.2. ISA and Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 3.3. Program Sequence Control Unit . . . . . . . . . . . . . . . . . . . . . .. 27. 3.3.1. Branch Instructions . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 3.3.2. Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 3.3.3. Customized Function Units (CFUs) . . . . . . . . . . . . . . . .. 31. 3.3.4. Exception Handling . . . . . . . . . . . . . . . . . . . . . . . .. 31. 3.3.5. Interrupt Handling . . . . . . . . . . . . . . . . . . . . . . . . .. 32. Scalar Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. 3.4.1. 32. 3.4. General Purpose Scalar Register File . . . . . . . . . . . . . . .. I.

(12) 3.4.2. System Register and Predication Register . . . . . . . . . . . . .. 33. VLIW Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 3.5.1. Ping-Pong Register File . . . . . . . . . . . . . . . . . . . . . .. 35. 3.5.2. Address/Accumulator Registers . . . . . . . . . . . . . . . . . .. 36. 3.5.3. Constant Registers . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 3.5.4. Status and Control Registers . . . . . . . . . . . . . . . . . . . .. 37. 3.5.5. Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 3.5.6. Data Communication . . . . . . . . . . . . . . . . . . . . . . . .. 40. 3.6. Conditional Execution Control . . . . . . . . . . . . . . . . . . . . . . .. 41. 3.7. Instruction Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 3.8. DSP Running Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 3.9. Versions of PACDSP . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 3.9.1. Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 3.9.2. Program Sequence Control Unit (PSCU) . . . . . . . . . . . . .. 43. 3.9.3. VLIW Data Path . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 3.9.4. Conditional Execution Control . . . . . . . . . . . . . . . . . . .. 46. 3.9.5. Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 3.10 Dual-Core Platform and the Tool Chains . . . . . . . . . . . . . . . . . .. 46. 3.5. 4 C Code Development and System Design 4.1. 4.2. 4.3. 4.4. 49. Initial Code Development . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.1.1. Profile Using the Profiler of ADS . . . . . . . . . . . . . . . . .. 51. Motion Coder Analysis and Design . . . . . . . . . . . . . . . . . . . . .. 54. 4.2.1. Modification of Search Order . . . . . . . . . . . . . . . . . . .. 55. 4.2.2. Use of A Tier Parameter . . . . . . . . . . . . . . . . . . . . . .. 55. Shape Coder Analysis and Design . . . . . . . . . . . . . . . . . . . . .. 59. 4.3.1. Multi-Symbol CAE [10] . . . . . . . . . . . . . . . . . . . . . .. 59. 4.3.2. Modification of Mode Selection Method . . . . . . . . . . . . . .. 63. 4.3.3. Reducing Candidates for MVs . . . . . . . . . . . . . . . . . . .. 65. Texture Coder Analysis and Design . . . . . . . . . . . . . . . . . . . .. 67. II.

(13) 4.5. Implementation Strategy on Dual-Core Platform . . . . . . . . . . . . . .. 5 Optimization of Implementation on PACDSP 5.1. 68 72. General Techniques of Code Optimization . . . . . . . . . . . . . . . . .. 72. 5.1.1. General Optimization Techniques . . . . . . . . . . . . . . . . .. 73. 5.1.2. Features of PACDSP . . . . . . . . . . . . . . . . . . . . . . . .. 75. 5.2. Fixed-Point DCT and IDCT . . . . . . . . . . . . . . . . . . . . . . . .. 76. 5.3. Fixed-Point Quantization . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. 5.3.1. H.263 Quantization Method . . . . . . . . . . . . . . . . . . . .. 83. 5.3.2. Lossless Fixed-Point Quantization Method . . . . . . . . . . . .. 83. 5.4. Implementation of SAD Calculation Using SIMD . . . . . . . . . . . . .. 84. 5.5. Simulation Results on PACDSP Instruction Set Simulator (ISS) . . . . . .. 87. 5.5.1. Statistics of Motion Estimation on ISS . . . . . . . . . . . . . . .. 87. 5.5.2. Statistics of Shape Coding on ISS . . . . . . . . . . . . . . . . .. 88. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 5.6. 6 Performance Analysis and Implementation Results 6.1. 6.2. 92. Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92. 6.1.1. Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92. 6.1.2. Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 93. 6.1.3. Frame Rate Estimation . . . . . . . . . . . . . . . . . . . . . . .. 95. Coding Quality and Bit Rates for Different QP . . . . . . . . . . . . . . .. 97. 7 Conclusion and Future Work. 101. 7.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101. 7.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102. III.

(14) List of Figures 2.1. Segmentation of a frame into VOPs (from [6]). . . . . . . . . . . . . . .. 4. 2.2. Structure of coded video data (from [7]). . . . . . . . . . . . . . . . . . .. 4. 2.3. Types of VOP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.4. Positions of luminance and chrominance samples in 4:2:0 data (from [8]).. 6. 2.5. Detailed structure of VO encoder (from [6]). . . . . . . . . . . . . . . . .. 7. 2.6. A VOP in bounding box (from [6]). . . . . . . . . . . . . . . . . . . . .. 8. 2.7. Pixel templates used for (a) INTRA and (b) INTER context calculation of BAB. The current pixel to be coded is marked with “?” (from [5]). . . . .. 11. 2.8. Simplified padding process (from [5]). . . . . . . . . . . . . . . . . . . .. 12. 2.9. Priority of boundary MBs surrounding an exterior MB (from [5]). . . . .. 12. 2.10 Interpolation scheme for half sample search (from [5]). . . . . . . . . . .. 14. 2.11 Motion vector prediction (from [8]). . . . . . . . . . . . . . . . . . . . .. 15. 2.12 Quantizers in H.263. (a) For intra DC coefficient only. (b) For inter DC and all AC coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 2.13 Prediction of DC coefficients of blocks in an intra MB (from [6]). . . . .. 20. 2.14 Prediction of AC coefficients of blocks in an intra MB (from [6]). . . . .. 21. 2.15 Scans for 8 × 8 blocks (from [5]). . . . . . . . . . . . . . . . . . . . . .. 21. 3.1. Architecture of the PACDSP [2]. . . . . . . . . . . . . . . . . . . . . . .. 28. 3.2. PACDSP instruction set architecture [4]. . . . . . . . . . . . . . . . . . .. 28. 3.3. Pipeline stages of the PACDSP [4]. . . . . . . . . . . . . . . . . . . . . .. 29. 3.4. The VLIW datapath register organization [2]. . . . . . . . . . . . . . . .. 35. 3.5. The four-way VLIW datapath of PACDSP [2]. . . . . . . . . . . . . . . .. 36. IV.

(15) 3.6. Address register file [2]. . . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 3.7. Data exchange between two clusters [2]. . . . . . . . . . . . . . . . . . .. 41. 3.8. Data broadcast among clusters [2]. . . . . . . . . . . . . . . . . . . . . .. 42. 3.9. Syntax of instruction packet [3]. . . . . . . . . . . . . . . . . . . . . . .. 44. 3.10 Simplified syntax of instruction packet [3]. . . . . . . . . . . . . . . . . .. 44. 3.11 Pipeline stages of PACDSP v2.0 [3]. . . . . . . . . . . . . . . . . . . . .. 45. 3.12 Dual-Core platform of PAC v3.0 system. . . . . . . . . . . . . . . . . . .. 48. 4.1. Flow of software development. . . . . . . . . . . . . . . . . . . . . . . .. 51. 4.2. Concept of spiral search. . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. 4.3. Dataflow of spiral search with tier parameter. . . . . . . . . . . . . . . .. 57. 4.4. Early termination percentage of SAD calculation with different tier para values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 4.5. PSNR values with different tier para values. . . . . . . . . . . . . . . . .. 58. 4.6. Examples of counting S0 (n) in INTRA mode (from [10]). . . . . . . . .. 60. 4.7. Distribution of SC (n) and S0 (n) (from [10]). . . . . . . . . . . . . . . .. 61. 4.8. Flowchart of multi-symbol CAE. . . . . . . . . . . . . . . . . . . . . . .. 62. 4.9. Performance improvement by multi-symbol CAE. . . . . . . . . . . . . .. 64. 4.10 Candidates for MVPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65. 4.11 Skip ratio (SR) with different thresholds. . . . . . . . . . . . . . . . . . .. 67. 4.12 DC spreading from quantized coefficient to output block. . . . . . . . . .. 69. 4.13 PACDSP v3.0 system. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 70. 4.14 Our basic dual-core software encoder design. . . . . . . . . . . . . . . .. 71. 5.1. Example of vector addition. . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 5.2. Example of static rescheduling technique. . . . . . . . . . . . . . . . . .. 74. 5.3. Example of loop unrolling technique. . . . . . . . . . . . . . . . . . . .. 75. 5.4. Example of software pipelining technique. . . . . . . . . . . . . . . . . .. 76. 5.5. The IDCT algorithm used in MoMuSys [9]. . . . . . . . . . . . . . . . .. 79. 5.6. The even-odd decomposition IDCT algorithm [13]. . . . . . . . . . . . .. 80. 5.7. The even-odd decomposition DCT algorithm [13]. . . . . . . . . . . . . .. 82. V.

(16) 5.8. An example code for 16×16 SAD calculation in PACDSP. . . . . . . . .. 85. 5.9. The syntax and operation of SAA.Q instruction. . . . . . . . . . . . . . .. 86. 5.10 Assembly code of masked 16×16 SAD calculation in our implementation. 88. VI.

(17) List of Tables 2.1. List of BAB Types (from [5]). . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2. Shape Coding Modes and Their Main Usages (from [5]) . . . . . . . . .. 10. 2.3. Default Quantization Matrix (Q) [5] . . . . . . . . . . . . . . . . . . . .. 19. 2.4. Nonlinear Scaler for DC Coefficients (from [5]) . . . . . . . . . . . . . .. 19. 2.5. Profiles and Tools in MPEG-4 Video (from [5]) . . . . . . . . . . . . . .. 24. 3.1. Pipeline Stages and Their Jobs . . . . . . . . . . . . . . . . . . . . . . .. 29. 3.2. System Register File [2] . . . . . . . . . . . . . . . . . . . . . . . . . .. 34. 3.3. Definitions of AMCR (from [2]) . . . . . . . . . . . . . . . . . . . . . .. 38. 3.4. Syntax of Address Modes and Supporting Units [3] . . . . . . . . . . . .. 39. 3.5. Instruction Type in Each Instruction Slot. . . . . . . . . . . . . . . . . .. 42. 3.6. Running Modes of the PACDSP v3.0 [2] . . . . . . . . . . . . . . . . . .. 45. 3.7. Modified Load/Store Instructions from PACDSP v2.0 to PACDSP v3.0. 47. 3.8. Comparison Instructions Supported by PACDSP v2.0 and PACDSP v3.0. 47. 4.1. Functionalities of Our Implementation . . . . . . . . . . . . . . . . . . .. 50. 4.2. Profile of Object-Based MPEG-4 Encoding of QCIF I-VOP on ADS . . .. 52. 4.3. Profile of Object-Based MPEG-4 Encoding of QCIF P-VOP on ADS . . .. 53. 4.4. Major Function in Motion Estimation (ME) . . . . . . . . . . . . . . . .. 54. 4.5. Percentage of Early Termination in SAD Calculation Under Different. .. Scan Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55. 4.6. CAE Modes and Associated VOP Types . . . . . . . . . . . . . . . . . .. 59. 4.7. Simulation Results of Skip Ratio (SR) and Shape Bits per VOP (bpv) . .. 66. 4.8. Execution Results on ADS of Reduced-Complexity ShapeInterMB Function 67 VII.

(18) 4.9. Number of Skipped Blocks in 101 Frames (1 I, 100 P) . . . . . . . . . . .. 69. 5.1. Comparison of Computational Complexity for 8-point IDCT . . . . . . .. 77. 5.2. Test of Compliance for Modified IEEE Std. 1180-1190 in MPEG-4 . . . .. 80. 5.3. Comparison of IDCT on Different Platforms . . . . . . . . . . . . . . . .. 81. 5.4. Comparison of DCT on Different Platforms . . . . . . . . . . . . . . . .. 82. 5.5. Fixed-Point Quantization Table . . . . . . . . . . . . . . . . . . . . . . .. 85. 5.6. Comparison of SAD Implementation on Different Platforms . . . . . . .. 86. 5.7. Execution Time of Motion Estimation for 1 P-VOP of QCIF on ISS . . .. 88. 5.8. Execution Time of Shape Coding for 1 P-VOP of QCIF on ISS . . . . . .. 89. 5.9. Execution Time of P-VOP Motion Estimation and Shape Coding after Algorithm Optimization on ARM926EJ-S . . . . . . . . . . . . . . . . .. 91. 5.10 Execution Time of P-VOP Motion Estimation and Shape Coding after Optimization on PACDSP . . . . . . . . . . . . . . . . . . . . . . . . . .. 91. 6.1. Code Size Profile of Object-Based MPEG-4 Video Encoder on PACDSP .. 94. 6.2. Data Size Profile of Object-Based MPEG-4 Video Encoder on PACDSP .. 95. 6.3. Frame Rate Estimation of Single-Core Implementation . . . . . . . . . .. 96. 6.4. Frame Rate Estimation for Intra Encoding of Dual-Core Implementation .. 98. 6.5. Frame Rate Estimation for Inter Encoding of Dual-Core Implementation .. 99. 6.6. Effects on Quality and Bit-Rate of Different QP values . . . . . . . . . . 100. VIII.

(19) Chapter 1 Introduction In modern industry, compression of audio-visual information becomes more and more important, especially for applications on mobile devices. Besides, digital signal processors (DSPs) are also popularly used on these mobile devices. Our goal is the implementation of MPEG-4 video encoder on a dual-core platform which contains a ARM core and a PACDSP core . The MPEG-4 standard for coding of audio-visual information has been widely adopted in various consumer products. There are several tools in the MPEG-4 standards, and they are used for different purposes. Since the present work is the first attempt to implement MPEG-4 video codecs on the dual-core system, we decide to implement the object-based part (arbitrary binary shape) of the MPEG-4 encoder first, and support the tools in simple profile without error-resilience. Some video tools of MPEG-4 video standard are left to the future work. PACDSP is a high performance, low cost VLIW (Very Long Instruction Word) DSP for multimedia applications [3]. Optimized architecture for data stream applications gives a strong reason for system designers to use PACDSP to implement media codecs. The instruction set architecture (ISA) of PACDSP is optimized for audio and video applications, so PACDSP is suitable for products with multi-standard codec requirement. In addition, the low power design for PACDSP makes it possible to use PACDSP on portable devices. For the best case in our dual-core implementation, we can encode the MPEG-4 video data over 33 frames and 43 frames per second in QCIF size for intra and inter encoding, 1.

(20) respectively. This thesis is organized as follows. Chapter 2 is the overview of MPEG-4 standards. Chapter 3 introduces the architecture and specification of the PACDSP platform. Chapter 4 is the development of our C code and the overall system design for our implementation. In addition, the algorithm analysis of MPEG-4 video encoder is also discussed in this chapter. The contents of chapter 5 are about the architecture optimization technologies and their experiment results. We also compare our implementation with that of other processors. The performance of dual-core implementation is shown in chapter 6. Finally, we will give some conclusions in chapter 7, and the future works are listed as well.. 2.

(21) Chapter 2 Overview of the MPEG-4 Video Standard The contents of this chapter have been taken to a large extent from [5]–[8]. MPEG-4 video standard provides core technologies allowing efficient storage, transmission and manipulation of video data in multimedia applications. It provides technologies to view, access and manipulate objects, with great error robustness at a large range of bit rates. Video activities in MPEG-4 aimed at providing solutions in the form of tools and algorithms enabling functionalities such as efficient compression, object scalability, spatial and temporal scalability, error resilience, and fine granularity scalability.. 2.1 Structure of MPEG-4 Video Data An input video sequence can be defined as a sequence of related frames or pictures, separated in time. MPEG-4 divides a frame into a number of video object planes (VOPs). A succession of VOPs is termed a video object (VO). Fig. 2.1 shows the decomposition of a picture into a number of separate VOPs. Each VO is encoded separately and multiplexed to form a bitstream that users can access and manipulate. The encoder sends, together with VOs, information about scene composition to indicate where and when VOPs of a VO are to be displayed. Figure 2.2 shows the organization of the coded MPEG-4 video data in a top-down hierarchical structure. 3.

(22) Figure 2.1: Segmentation of a frame into VOPs (from [6]).. Figure 2.2: Structure of coded video data (from [7]).. 4.

(23) • VideoSession (VS): A video session is the highest syntactic structure of the coded visual bitstream and simply consists of an ordered collection of video objects. • VideoObject (VO): A video object represents a complete scene or a portion of a scene with a semantic. In the simplest case this can be a rectangular frame, or it can be an arbitrarily shaped object corresponding to a physical object or background of the scene. • VideoObjectLayer (VOL): Each video object can be encoded in scalable (multilayer) or non-scalable (single layer) form, depending on the application, represented by VOL. The VOL provides support for scalable coding. A video object can be encoded using spatial or temporal scalability, going from coarse to fine resolution. • GroupOfVideoObjectPlanes (GOV): Group of video object planes are optional entities. The GOV groups video object planes together. GOVs can provide points in the bitstream where VOPs are encoded independently from one another, and can thus provide random access points into the bitstream. • VideoObjectPlane (VOP): A VOP is a time sample of a video object. As in MPEG-4 standard, there are four types of VOP, as illustrated in Fig. 2.3. These are briefly explained below: 1. An intra-coded (I) VOP is coded using information only from itself. 2. A predictive-coded (P) VOP is a VOP which is coded using motion compensated prediction from a past reference VOP. 3. A bidirectionally predictive-coded (B) VOP is a VOP which is coded using motion compensated prediction from a past and/or future reference VOP(s). 4. A sprite (S) VOP is a VOP for a sprite object or a VOP that is coded using prediction based on global motion compensation from a past reference VOP. We omit further introduction of the S VOP.. 5.

(24) I−frame. P−frame. B−frame P−frame. I−frame. Figure 2.3: Types of VOP.. Figure 2.4: Positions of luminance and chrominance samples in 4:2:0 data (from [8]). The macroblock (MB) is a basic coding structure constructing VOP. An MB contains a section of the luminance component of 16 × 16 pixels in size, and the sub-sampled chrominance components in 4:2:0 format. The luminance and chrominance samples are positioned as shown in Fig. 2.4. In this format, an MB is divided into 4 luminance blocks and 2 chrominance blocks, each 8 × 8 pixels in size.. 2.2 MPEG-4 Video Texture Coding The contents of this section have been taken to a large extent from [5]–[8]. 6.

(25) Figure 2.5: Detailed structure of VO encoder (from [6]). Fig. 2.5 presrnts the detailed structure of the VO encoder. The encoder is mainly composed of three parts: shape encoder, motion encoder and texture coder. The reconstructed VOP is obtained by combining the shape, texture and motion information. The part of shape coding constitutes the major difference between frame-based and objectbased coding.. 2.2.1 VOP Formation The video object shape information is obtained after segmentation. The shape information is hereafter referred to as alpha plane, which is used to form a VOP. There are two kinds of alpha planes in MPEG-4, binary alpha plane and gray scale alpha plane. For the binary alpha plane, the value 255 is assigned to pixels belonging to the objects and 0 is assigned to pixels outside the objects. The value of gray scale alpha plane is used for hybrid (of natural and synthetic) scenes generated by blue screen composition and is represented by an 8-bit component. For the binary alpha plane, a rectangular bounding box enclosing the shape to be coded is formed such that its horizontal and vertical dimensions are extended to multiples. 7.

(26) Figure 2.6: A VOP in bounding box (from [6]). of 16 pixels (MB size). For efficient coding, it is important to minimize the number of macroblocks contained in the bounding box. Fig. 2.6 shows an example of an arbitrary shape VOP with bounding box and the MB structure.. 2.2.2 Shape Coding After VOP formation, the alpha plane of VOP will be coded prior to coding motion vector and texture based on the VOP image bounding box. Binary alpha planes are encoded by modified context-based arithmetic encoding (CAE) while grey scale alpha planes are encoded by motion compensated DCT similar to texture coding. An alpha plane is also bounded by an extended rectangular bounding box. The bounded alpha plane is partitioned into blocks of 16 × 16 samples called alpha block and the encoding/decoding process is done per alpha block. Binary Shape Coding CAE and motion compensation are the basic tools for encoding binary alpha blocks (BABs) which are the primary unit in binary shape coding. InterCAE and IntraCAE are the variants of the CAE algorithm used with and without motion compensation, respectively. The motion vectors which are differentially coded can be computed by searching for a best match position. Each BAB can be coded in one of the following modes: 8.

(27) Table 2.1: List of BAB Types (from [5]) BAB Types. Semantic. Used in. 0. MVDs==0 and No Update. P-, B-, and S(GMC)-VOPs. 1. MVDs!=0 and No Update. P-, B-, and S(GMC)-VOPs. 2. Transparent. All VOP Types. 3. Opaque. All VOP Types. 4. IntraCAE. All VOP Types. 5. MVDs==0 and InterCAE. P-, B-, and S(GMC)-VOPs. 6. MVDs!=0 and InterCAE. P-, B-, and S(GMC)-VOPs. Note: GMC = Global Motion Compensation.. 1. The block is all transparent. In this case no coding is necessary. Texture information is not coded for such blocks either. 2. The block is all opaque. Shape coding is not necessary in this case, but texture information needs to be coded. 3. The block is coded using IntraCAE without use of past information. 4. Motion vector difference (MVD) is zero but the block is not updated. 5. MVD is non-zero, but the block is not updated. 6. MVD is zero and the block is updated. InterCAE is used for coding the block update. 7. MVD is non-zero, and the block is coded by InterCAE. Table 2.1 shows the BAB types and VOP types they are. If the encoder need rate control and rate reduction, the encoder realizes these through size-conversion of binary alpha information. To be specific, a 4:1 downsampled binary alpha block is used first and if the shape errors are greater than a designed threshold value, a 2:1 downsampled binary alpha block is used next, again if it is found unacceptable, an unsubsampled binary alpha block is used. 9.

(28) Table 2.2: Shape Coding Modes and Their Main Usages (from [5]) Mode. Main Usage. 1. intra. I frames, arbitrarily shaped still texture object, error resilience. 2. inter, inter MC. P frames. 3. horizontal/vertical scanning. Low-bitrate shape coding. 4. Subsampling to a block size 8×8 or 4×4. Low-bitrate lossy shape coding. The MPEG-4 standard allows for 18 coding modes of each BAB: (intra/inter/inter MC)×(horizontal/vertical scanning)×(subsampling factor 0/1/2). The influence of different shape coding modes on the performance of the coder in terms of coding efficiency but also in terms of computational complexity is of interest. Table 2.2 shows the main usage for each coding mode. CAE is used to code each binary pixel of the BAB. Prior to coding the first pixel, the arithmetic encoder is initialized. Each binary pixel is then encoded in raster order. The process for encoding a given pixel is as follows: 1. Compute a context number. 2. Index a probability table using the context number. 3. Use the indexed probability to drive an arithmetic encoder. When the final pixel has been processed, the arithmetic code is terminated. Fig. 2.7 shows the templates for the context calculation for INTRA and INTER modes. Gray Scale Shape Coding The gray scale shape coding has a structure similar to that of binary shape with the difference that each pixel can take on a range of values (usually 0 to 255) representing the degree of the transparency of that pixel. The pixel value 0 corresponds to a completely transparent pixel and 255 to a completely opaque pixel. Intermediate values of the pixel correspond to intermediate degrees of transparencies of that pixel. 10.

(29) Figure 2.7: Pixel templates used for (a) INTRA and (b) INTER context calculation of BAB. The current pixel to be coded is marked with “?” (from [5]).. 2.2.3 Motion Coder Motion coding is essential for P-VOP and B-VOP to reduce temporal redundancy. The motion coder consists of a motion estimator, motion compensator, previous/next VOPs store and motion vector (MV) predictor and coder. Furthermore, in order to perform the motion prediction for VOP of arbitrary shape, a special padding technique is required for the reference VOP before motion estimation. Padding Process Fig. 2.8 shows a simplified diagram of the padding process. The value of luminance and chrominance samples outside the VOP are defined by the padding process. A decoded MB d[y][x] is padded by referring to the corresponding decoded shape block s[y][x]. An MB that lies on the VOP boundary is padded by replicating the boundary samples of the VOP towards the exterior. This process is divided into horizontal repetitive padding and vertical repetitive padding. The remaining MBs that are completely outside the VOP are filled by extended padding. • Horizontal repetitive padding: Each sample at the boundary of a VOP is replicated horizontally to the left and/or right direction in order to fill the transparent region 11.

(30) Figure 2.8: Simplified padding process (from [5]).. Figure 2.9: Priority of boundary MBs surrounding an exterior MB (from [5]).. 12.

(31) outside the VOP of a boundary block. If there are two boundary sample values for filling, the two sample values are averaged. • Vertical repetitive padding: The remaining unfilled transparent region from above procedure are padded by similar process as the horizontal repetitive padding but in the vertical direction. • Extended padding: Exterior MBs immediately next to boundary MBs are filled by replicating the samples at the border of the boundary MBs. If an exterior MBs is next to more than one boundary MBs, one of the MBs is chosen, according to the priority shown in Fig. 2.9. The remaining exterior MBs (not located next to any boundary MBs) are filled with 128. Motion Estimation Motion estimation (ME) is a method of prediction between adjacent frames/pkctures. In general, the ME techniques used in MPEG-4 can be seen as an extension of standard MPEG-1/2 or H.263 block matching techniques with modified block (polygon) matching to handle arbitrary-shaped VOPs which is block-based method. For an arbitrary shape VOP, the bounded VOP is first extended to the right-bottom side to multiples of MB size. The alpha value of the extended pixels is set to zero. The SAD is used for error measure, and is computed only for the pixels with nonzero alpha values. The basic motion estimation may be performed on 16 × 16 luminance MBs. The motion vector is specified to half-pixel accuracy. In many coding software implementations, the motion estimation is performed by full search to integer pixel accuracy vector and, using it as the initial estimate, a half pixel search is performed around it. Interpolation of MB is necessary because the motion vector may be non-integer. Fig. 2.10 illustrates the bilinear interpolation method. In the MPEG-4 standard, besides motion vector for 16 × 16 MB, motion vector can be sent for individual 8 × 8 blocks to reduce prediction errors more.. 13.

(32) A a +. b. c. d. C. +. B. +. + Integer pixel position Half pixel position. D. +. a = A, b = (A + B + 1 - rounding_control) / 2 c = (A + C + 1 - rounding_control) / 2, d = (A + B + C + D + 2 - rounding_control) / 4. Figure 2.10: Interpolation scheme for half sample search (from [5]). Motion Vector Encoder The motion vector must be coded when using INTER mode coding. Horizontal and vertical motion vectors are coded differentially by using a spatial neighborhood of three motion vectors that have already been coded (see Fig. 2.11). These three motion vectors are candidate predictors for differential coding. The differential coding of motion vectors is performed with reference to the reconstructed shape. In the special cases at the borders of the current VOP the following decision rules are applied: 1. If the MB of one and only one candidate predictor is outside the VOP, it is set to zero. 2. If the MBs of two and only two candidate predictors are outside the VOP, they are set to the third candidate predictor. 3. If the MBs of all three candidate predictors are outside the VOP, they are set to zero. For horizontal and vertical components, the median value of the three candidates for the same component is used as predictor, denoted P x and P y, respectively: P x = M edian(M V 1x, M V 2x, M V 3x), P y = M edian(M V 1y, M V 2y, M V 3y). Then, the vector differences, M V Dx (= M V x − P x) and M V Dy (= M V y − P y), are coded by variable-length coding (VLC). 14.

(33) MV2 MV3 MV1 MV. MV2 MV3. MV : Current motion vector MV1: Previous motion vector MV2: Above motion vector MV3: Above right motion vector. MV1 MV. (0,0) MV. MV2 (0,0). MV1 MV1. MV1 MV. : VOP border. Figure 2.11: Motion vector prediction (from [8]). Motion Compensation The motion compensator uses motion vectors to compute motion compensated prediction block, pred[i][j], from the same reference VOP. In addition to basic motion compensation processing, three alternatives are supported, namely, unrestricted motion compensation, four MV motion compensation and overlapped motion compensation. For unrestricted motion compensation, the motion vectors are allowed to point outside the decoded area of a reference VOP. The pred[i][j] is defined as follows: xref = min(max(xcurr + dx, vhmcsr), xdim + vhmcsr − 1), yref = min(max(ycurr + dy, vvmcsr), ydim + vvmcsr − 1), where vhmcsr = vop horizontal mc spatial ref, vvmcsr = vop vertical mc spatial ref, (ycurr, xcurr) is the coordinate of a sample in the current VOP, (yref, xref ) is the coordinate of a sample in the reference VOP, (dy, dx) is the motion vector, and (ydim, xdim) is the dimension of the bounding rectangle of the reference VOP. One/two/four vectors decision is indicated by the MCBPC codeword and field prediction flag for each MB. If one motion vector is transmitted for a certain MB, this is considered four vectors with the same value as the MV. When two field motion vectors are transmitted, each of the four block prediction motion vectors has the value equal to the average of 15.

(34) the field motion vectors (rounded such that all fractional pixel offsets become half pixel offsets). If four vectors are used, each of the motion vectors is used for all pixels in one of the four luminance blocks in the MB. Overlapped motion compensation is performed when the flag obmc disable = 0. Each pixel in an 8 × 8 luminance prediction block is a weighted sum of three prediction values, divided by 8. The creation of each pixel P (i, j), in an 8 × 8 luminance prediction block is governed by the following equation: P (i, j) =. (p(i+M Vx0 ,j+M Vy0 )∗H0 (i,j)+p(i+M Vx1 ,j+M Vy1 )∗H1 (i,j)+p(i+M Vx2 ,j+M Vy2 )∗H2 (i,j)+4) , 8. where (M Vx0 , M Vy0 ) denotes the motion vector for the current block, (M Vx1 , M Vy1 ) the motion vector of the block above or below, (M Vx2 , M Vy2 ) the motion vector of the block to the left or to the right, and H0 (i, j), H1 (i, j), and H2 (i, j) are the weighting value of each pixel in the current block and neighbor blocks. Since the VOP may be coded in P or B mode, there are three types of motion prediction, namely forward mode, backward mode, and bi-directional mode. The different modes make different predictions P¯ (i, j) as follows. 1. Forward mode: Only the forward vector (MVFx,MVFy) is applied in this mode. The prediction blocks P¯y (i, j), P¯u (i, j), P¯v (i, j) are generated from the forward reference VOP. 2. Backward mode: Only the backward vector (MVBx,MVBy) is applied. The prediction blocks P¯y (i, j), P¯u (i, j), P¯v (i, j) are generated from the backward reference VOP. 3. Bi-directional mode: Both the forward vector (MVFx,MVFy) and the backward vector (MVBx,MVBy) are applied. The prediction blocks P¯y (i, j), P¯u (i, j), P¯v (i, j) are generated from the forward and the backward reference VOPs by doing the forward and the backward predictions and then averaging both predictions pixel by pixel.. 16.

(35) 2.2.4 Texture Coder The texture information of a VOP is present in the luminance Y and two chrominance components Cb and Cr of the video signal. The texture information is directly in the luminance and chrominance components for an I-VOP. However, for a P-VOP and a B-VOP, the texture information represents the residual values remaining after motioncompensated prediction. The texture coder includes padding process (for object-based coding, and applied only if needed), 8 × 8 two-dimensional (2D) discrete cosine transform (DCT), quantization, coefficient prediction, coefficient scan and variable length coding (VLC). Padding Process When the shape of the VOP is arbitrary, two types of MB exits, those that lie inside the VOP and those that lie on the boundary of the VOP. The MBs that lie completely inside the VOP are coded using a technique identical to the technique used in H.263. The MBs that lie on the boundary of the shape need to be padded before texture coding. For residual error blocks after motion compensation, the region outside the VOP within the blocks are padded with zero. For intra blocks, the padding is performed in a three-step procedure called low pass extrapolation (LPE). This procedure is as follows: 1. Compute the arithmetic mean value m of the pixels f (i, j) in the blocks that belong to the VOP as m = (1/N ). X. f (i, j). (i,j)∈V OP. where N is the number of pixels situated with the VOP. 2. Assign m to each block pixel situated outside of the VOP region. 3. Apply the following filtering operation to each block pixel f (i, j) outside of the VOP region, in raster-scan oeder: f (i, j) =. f (i, j − 1) + f (i − 1, j) + f (i, j + 1) + f (i + 1, j) . 4. 17.

(36) If one or more of the four pixels used for filtering are outside the block, the corresponding pixels are not included into the filtering operation and the divisor 4 is reduced accordingly. Discrete Cosine Transform (DCT) Coding Similar to MPEG-1 and MPEG-2, the transform coding in the MPEG-4 standard is based on 2D 8×8 DCT. Before quantization, the encoder does forward transform. Then the encoder does inverse transform after inverse quantization for reconstructing the VOP. Quantization MPEG-4 video supports two quantization techniques, one referred to as the H.263 quantization method and the other, the MPEG quantization method. The H.263 quantization method is with dead zone for intra and inter AC coefficients and with no dead zone for intra DC coefficients. The MPEG quantization method is uniform quantizer with the default matrix as shown in Table 2.3. Figure 2.12 shows the quantizer characteristics in H.263. It has uniform quantization for intra DC coefficients and nearly uniform midtread quantization for the inter DC and all AC coefficients. All coefficients in a MB go through the same quantizer step size Q, which can be changed in increments of 2 from 2 to 62 as desired. Furthermore, in order to provide a higher coding efficiency, Table 2.4 shows a nonlinear scaler which is used for the DC coefficient of 8 × 8 block in MEPG-4 video. Note that the characteristics of nonlinear scaling are different between the luminance and chrominance blocks and depend on the quantizer used for the block. Intra Prediction When coding an intra block, the DC coefficients and many AC coefficients are coded by intra prediction. Intra prediction is an operation used in MPEG-4 standards to reduce the spatial redundancy between 8 × 8 blocks. DC prediction is illustrated in Fig. 2.13. The quantized intra coefficients are predicted with three previous decoded DC coefficients. For example, the DC coefficients of block X 18.

(37) Th+1/2Q. 3/2Q 1/2Q. −Th −Th−Q. −1/2Q. Th. −3/2Q. (a). (b). Figure 2.12: Quantizers in H.263. (a) For intra DC coefficient only. (b) For inter DC and all AC coefficients.. Table 2.3: Default Quantization Matrix (Q) [5] Intra 8. Inter. 16. 19. 22. 26. 27. 29. 34. 16 16. 16. 16. 16. 16. 16. 16. 16 16. 22. 24. 27. 29. 34. 37. 16 16. 16. 16. 16. 16. 16. 16. 19 22. 26. 27. 29. 34. 34. 38. 16 16. 16. 16. 16. 16. 16. 16. 22 22. 26. 27. 29. 34. 37. 40. 16 16. 16. 16. 16. 16. 16. 16. 22 26. 27. 29. 32. 35. 40. 48. 16 16. 16. 16. 16. 16. 16. 16. 26 27. 29. 32. 35. 40. 48. 58. 16 16. 16. 16. 16. 16. 16. 16. 26 27. 29. 34. 38. 46. 56. 69. 16 16. 16. 16. 16. 16. 16. 16. 27 29. 35. 38. 46. 56. 69. 83. 16 16. 16. 16. 16. 16. 16. 16. Table 2.4: Nonlinear Scaler for DC Coefficients (from [5]) Component. DC Scaler for Q Range 1–4. 5–8. 9–24. 25–31. Luminance. 8. 2Q. Q+8. 2Q − 16. Chrominance. 8. (Q + 13)/2. Q − 16. 19.

(38) 0 0D 0 0 0 0 0 0 C B 000000000000000000000000 0 0 0 0 0 0or00000000000000000000or00000000000000000000000000 A 00 00 00 00 00X00 00 00 00 00 00 00 00 00 00 Y00 00 00 00 00 00 00 00 Macroblock 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000000000000000000000 Figure 2.13: Prediction of DC coefficients of blocks in an intra MB (from [6]). is predicted from the DC coefficients of blocks A, B and C. Unlike MPEG-2, the method of prediction in MPEG-4 is gradient based. In computing the prediction of block X, if the absolute value of a horizontal gradient is less than the absolute value of a vertical gradient, then the quantized DC (QDC) of block C is used as the prediction, else the QDC value of block A is used. The AC prediction depends on DC prediction, as shown in Fig. 2.14. The AC coefficients in the first row or in the first column are predicted with three previous decoded AC coefficients. The direction of prediction is the same as DC prediction. Scan and VLC Figure 2.15 shows three kinds of scan, alternate-horizontal, alternate-vertical and zigzag (the normal scan used in H.263 and MPEG-1), to scan the DC and AC coefficients and change the 2D block data to 1D data. The actual scan used depends on the coefficient prediction method used. If the direction is vertical, alternate-horizontal scan is used for the current block. If the direction is horizontal, alternate-vertical scan is selected for the current block. For all other blocks, zigzag scanned is used. The coefficients after scan usually become data with many zeros at the end. This kind of data stream is good for run-length coding. In the MPEG-4 standard, differential DC coefficients in intra blocks are encoded in VLC. However, the AC coefficients are encoded by the variable length codes for EVENTs, where an EVENT consists of a last non-zero coefficient indication (LAST), the number of successive zeros preceding the 20.

(39) 000000000000000000 000000000000000000 C. B. 000 000 000 00 00 00 00 00 00. D. 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000or000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000X000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000Y000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 Macroblock 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000000000000000000000000000000000000000000000. or. A. Figure 2.14: Prediction of AC coefficients of blocks in an intra MB (from [6]).. Figure 2.15: Scans for 8 × 8 blocks (from [5]). coded coefficient (RUN), and the non-zero value of the coded coefficient (LEVEL). Some statistically rare events have no VLC words to represent them. For them an escape coding method is used.. 2.2.5 Other Video Coding Tools [6] In addition to texture video coding, there are some special tools defined in MPEG-4. In this section, we briefly introduce robust video coding and scalable coding.. 21.

(40) Robust Video Coding Error resilience is a particular concern over wireless networks. In the error resilient mode, the MPEG-4 video offers a number of tools as follows: 1. Object priorities The object based organization of MPEG-4 video facilitates prioritizing of the semantic objects based on their relevance. Further, the VOP types lend themselves to a form of automatic prioritization since, B-VOPs are noncausal and do not contribute to error propagation and thus can be transmitted at a lower priority or discarded in case of severe errors. 2. Resynchronization The encoder can enhance error resilience by placing resynchronization (resync) markers in the bitstream with approximately constant spacing, such as beginning of each MB. 3. Data partitioning Data partitioning provides a mechanism to increase error resilience by separating the normal motion and texture data of all MBs in a video packet and send all of the motion data followed by a motion marker, followed by all of the texture data. 4. Reversible VLCs The reversible VLCs offer a mechanism for a decoder to recover additional texture data in the presence of errors since the special design of reversible VLCs enables decoding of codewords in both the forward (normal) and the reverse directions. 5. Intra update and scalable coding To prevent error propagation, intra update is a simple method to reduce the problem. However, more intra coding will reduce the coding efficiency. Another method is scalable coding, which can prevent error propagation without more intra coding.. 22.

(41) Scalable Coding The scalability tools in MPEG-4 video are designed to support applications beyond that supported by single layer video, such as internet video, wireless video, multi-quality video services, video database browsing, etc. In scalable video coding, it is assumed that given a coded bitstream, decoders of various complexities can decode and display appropriate reproductions of coded video. Several different forms of scalability are provided in MPEG-4 video. Temporal and spatial scalability are the most basic scalability tools among them. The Fine Granularity Scalability (FGS) which supports continuous scalability of bit rate and video quality is also defined.. 2.3 Profiles and Levels [5] Although there are many tools in the MPEG-4 standard, not every MPEG-4 decoder will have to implement all of them. Similar to MPEG-2, profiles and levels are defined as subsets of the entire bitstreams syntax of all the tools. The purpose of defining conformance points in the form of profiles and levels is to facilitate interchange of bitstreams among different applications. There are eight profiles defined in MPEG-4: simple, core, main, simple scalable, animated & mesh, basic animated texture, still scalable texture and simple face. The details are given in Table 2.5. Compared with previous standards, the simple profile of MPEG-4 is similar to the coding method in H.263. The difference is that the simple profile has error resilience but does not have B-frame coding. The simple scalable profile is the same as simple profile, but with rectangular scalability added. The core profile is the profile with all tools of the simple profile, temporal scalability, B-VOP coding and binary shape coding. The main profile is the profile with all tools in core profile, gray shape coding, interlace and sprite coding. The other profiles are for particular purposes, such as 2D dynamic mesh coding and facial animation coding.. 23.

(42) Table 2.5: Profiles and Tools in MPEG-4 Video (from [5]) Simple. Core. Main. Tools. Simple. Animated. Basic. Still. Simple. Scalable. 2D Mesh. Animated. Scalable. Face. Texture. Texture. V. V. V. V. V. Basic 1. I VOP 2. P VOP. V. V. V. V. V. V. V. V. V. V. V. V. V. B-VOP. V. V. Method 1/Method 2. V. V. V. V. V. V. V. V. V. 3. AC/DC Prediction 4. 4MV Unrestricted MV Error resilience 1. Slice Resynchronization 2. Data Partitioning 3. Reversible VLC Short Header. V V. V. quantization P-VOP based temporal scalability 1. Rectangular 2. Arbitrary Shape Binary Shape Gray Shape. V. Interlace. V. Sprite. V. Temporal scalability. V. (rectangular) Spatial scalability. V. (rectangular) Scalable still texture 2D dynamic mesh with uniform topology 2D dynamic mesh. V. with Delaunay topology Facial animation. V. parameters. 24.

(43) Chapter 3 Overview of PACDSP The contents of this chapter have been taken to a large extent from [3]–[4].. 3.1 Introduction Programmable embedded solutions are attractive for their lower development efforts, upgradeability to support new applications and easier maintenance. These factors reduce time-to-market and extend time-in-market, and thus make the best profit-sense. Today’s media processing demands extremely high computations with real-time constraints in audio, image or video applications. Instruction parallelism has been exploited to speed up the high-performance microprocessors, and VLIW machines have low-cost compiler scheduling with deterministic execution time and have thus become the trend of high performance DSP processors. Conventional VLIW processors are notorious for their poor code density, because the unused instruction slots must be filled by NOPs. Variable-length VLIW instruction packet eliminates NOPs by run-time instruction dispatch, compared to the conventional position-coded VLIW processors where each functional unit (FU) has a corresponding bit-field in the instruction packet. Indirect VLIW has an internal instruction buffer for the VLIW instruction packets. With this instruction buffer and the pre-fetch scheme, the VLIW processor can reduce instruction memory bandwidth requirement and power consumption of instruction fetches. 25.

(44) The complexity of the register file (RF) grows exponentially as more and more FUs are integrated on a chip and operate concurrently to achieve the performance requirements. Thus the RF is frequently partitioned into execution clusters with explicit interconnection networks among the clusters to significantly reduce the complexity at the cost of small performance penalty. For high performance, the PACDSP is a VLIW processor with single instruction multiple data (SIMD) instruction set architecture (ISA). The software supported scheduling reduces hardware complexity and power consumption. Variable length instruction and instruction packet solve the poor code density problem of the conventional VLIW architecture. Another feature of the PACDSP, cluster architecture, reduces not only ports and entries of the register files but also the power consumption of read/write operations. Key features of the PACDSP include the following items: • Scalable VLIW datapath for easy extension of the computing power. • Variable instruction word/packet length for compensating the drawback of poor code density in the VLIW architecture. • Heterogeneous register files for more straightforward operations, less port number and smaller entries in each RF to improve the performance and reduce power and area. • Constant register file in each cluster for the storage of fixed data used in the applications to reduce the frequency of data movement which may cost significant of power consumption. • Inter-cluster communication by memory controller for reusing hardware resource and reducing the port number of ping-pong RF in order to reduce power and area and to increase the scalability. • Optimized interrupt design with fast interrupt response time with hardware supported context switch to reduce the processing time of interrupt service routine (ISR).. 26.

(45) • Hierarchical encoding scheme reducing the dependency between instructions and packets to reduce area and latency of the dispatch unit. • Dynamic power management for power saving. • Customized FU interface that can be used to enhance DSP functionalities. The architecture of the PACDSP v3.0 is shown in Fig. 3.1. The following sections will briefly introduce its pipeline stages and its core elements, including the Program Sequence Control Unit (PSCU), Scalar Unit, Clusters (VLIW datapath), and Customized Function Unit (CFU). The accelerators that execute in different threads and synchronize the execution results through the scalar unit can enhance the computation power of the VLIW datapath.. 3.2 ISA and Pipeline Stages There are three major divisions in the PACDSP instruction set architecture (ISA): Program Sequence Control, Scalar and VLIW Data Path. In each division, the instructions are divided into categories by function units. Figure 3.2 depicts the ISA of the PACDSP. Figure 3.3 shows the pipeline stages of PACDSP. The program sequence control unit operation can be divided into four stages, which are IF, IMEM, IDP, and ID. Scalar unit and VLIW datapath operation are both divided into five stages, namely RO, EX1, EX2, EX3, and WB. The job of each pipeline stage is shown in Table 3.1.. 3.3 Program Sequence Control Unit The program sequence control unit (PSCU) is a main component in the DSP kernel. Basically, we can regard it as the combination of the control path and the instruction path. The control path affects the program counter updating, address fetch, pipeline control, hardware context shadowing, interrupt handling, exception handling, etc., according to the input control signals from elsewhere in the PACDSP. In addition, the instruction path is responsible for fetching, dispatching, and decoding of the instruction packets. 27.

(46) Figure 3.1: Architecture of the PACDSP [2].. Figure 3.2: PACDSP instruction set architecture [4].. 28.

(47) Figure 3.3: Pipeline stages of the PACDSP [4]. Table 3.1: Pipeline Stages and Their Jobs Stage. Job. IF. Instruction Fetch. IMEM. Instruction Memory Access. IDP. Instruction Dispatch. ID. Instruction Decode. RO. Read Operand. EX1. Execution One. EX2. Execution Two. EX3. Execution Three. WB. Write Back. 3.3.1 Branch Instructions Branch instructions can be grouped into two categories, conditional branches and unconditional branches. There are three addressing modes defined in the PACDSP v3.0 for generating the branch target address: • PC-relative Add up to 32-bit signed immediate offset to the address in the PC register, and take the result as the branch target address, i.e., TA = PC + OFFSET where TA is the target address, PC is the address in Program Counter, and OFFSET is the immediate value defined in branch instruction. 29.

(48) • Register Take the value in the register as the target address, i.e., TA = Rs where TA is the target address and Rs is the source register defined in branch instruction. • Register-relative Add up to 32-bit signed immediate offset to the address saved in the register and take the result as the branch target address, i.e., TA = Rs + OFFSET where TA is the target address, Rs is the source register defined in branch instruction, and OFFSET is the immediate value defined in branch instruction. In some circumstances, a branch operation may need to save the return address to ensure correct working of the program when it returns. The branch instructions defined in the PACDSP support saving of the return address into the assigned register. The programmer should take care of the return addresses of nested loops. There are five branch delay slots in the PACDSP, and the programmer could put the branch-independent instructions in the delay slots. There are some constraints about instructions in the delay slots. Reference [4] gives the details of the programming constraints.. 3.3.2 Loops The programmer can use the LBCB or B instruction to describe program loops. LBCB is similar to branch, but instead of checking a predicate register (P0–P15), LBCB checks a general purpose register (R0–R15) to decide whether to branch or not. Since there are 16 general purpose registers (R0–R15), up to 16 levels of nested loop can be supported with the use of the LBCB instruction. 30.

(49) There is a constraint in using LBCB to control a nested loop. The outer loop should fully contain the inner loop. No exception will be generated if the constraint is violated, but the program behavior may be different from expectation. However, conditional branches can be used inside the nested loop to implement some special branch behaviors in higher level languages, for example, “break” and “continue” in C.. 3.3.3 Customized Function Units (CFUs) The PACDSP provides Customized Function Unit Interface for extension usage. The user can attach co-processors or customized function units to PACDSP and handle them through the scalar instructions. If error happens in a customized function unit, it can inform the PACDSP and the PACDSP can process it based on the particular configuration. If the given work has finished successfully, the PACDSP can use its results and continue to work. It is recommended that if a co-processor is used, communication with it be made through this interface, or the user will have to pay much more effort to handle it.. 3.3.4 Exception Handling Unpredictable exceptions may occur during program execution. The exceptions need to be handled correctly for correct execution results. Exceptions may be caused by hardware (e.g., overflow), software, internal (e.g., undefined instruction), or external (e.g., coprocessor exception). When an exception happens whether PACDSP is running a program or not, PACDSP will check for mask information. If the exception is masked, PACDSP will ignore the exception and return to normal execution. If the exception is unmasked, it will be taken. PACDSP will freeze its pipeline, finish the instructions before the PC which introduced the exception, and recover the states for consistence. After the state is recovered, PACDSP will issue exception handling ISR to inform the MPU and the Embedded ICE, waiting for different commands to resolve the exception.. 31.

(50) 3.3.5 Interrupt Handling Two types of interrupt are supported by the PACDSP. One is fast interrupt request (FIQ), which has the higher priority, and the second is interrupt request (IRQ). The difference between them is that the FIQ has fixed ISR address and IRQ needs ISR to check the IRQ source to obtain the proper ISR address. In the PACDSP, the minimum latency from interrupt request to the first ISR instruction to be executed is 4 cycles for both types of interrupt, and it may be postponed when the ISR experiences cache miss.. 3.4 Scalar Unit The scalar unit plays an important role in handling control-based task for PACDSP. It also has a simple capacity for data computing. Thus, the scalar unit is like a RISC machine. Programmers can exploit computing capacity of the scalar unit to increase overall instruction-level parallelism (ILP) in compute-based task. The scalar unit mainly consists of one adder, one down-counter, one comparator, one shifter and one logical ALU. The scalar unit has four major functions as follows: • Program flow control function. • Data processing function. • Memory access function. • Data transfer function.. 3.4.1 General Purpose Scalar Register File In the scalar unit of the PACDSP kernel, there are sixteen 32-bit general purpose registers named R0 to R15. These registers are viewed as the loop boundary counter, the timer and the address register in the LBCB, WAIT and Branch/Load/Store instructions, respectively. In other instructions, they are viewed as data registers.. 32.

(51) 3.4.2 System Register and Predication Register There are 16 system registers named as SR0 to SR15 in PACDSP. Table 3.2 shows the names, the widths, the meaning of all the system registers in PACDSP. Note that each bit in SR0 is used as the predication register and are named P0 to P15, where the value of P0 is always true. Most instructions of PACDSP can be executed conditionally according to the values of predication registers.. 3.5 VLIW Datapath As shown in Fig. 3.4, the VLIW datapath of PACDSP is constructed with distributed register file: ping-pong registers, accumulator registers, address registers, constant registers and some control flags. If the instruction must write into two consecutive destination registers, for example, DLW and FMUL.D, the destination register number has to be even because of banked structure. The VLIW datapath of PACDSP is constructed in two clusters, and each contains an arithmetic unit (AU) and a load/store unit (L/S) as shown in Fig. 3.5. Therefore, it can execute four instructions simultaneously, and is thus called a four-way VLIW datapath. The VLIW datapath supports SIMD (Single Instruction Multiple Data) operation. It executes in three modes: Single (32-bit or 40-bit), Dual (16-bit) and Quad (8-bit). There are also three types of precision in the datapath of PACDSP: Full, Integer and Fractional. Arithmetic Unit (AU) The arithmetic unit comprises 40-bit modules which are divided according to functions. The function types supported by the AU are shown below: • Arithmetic and comparison instructions. • Data transfer instructions. • Bit manipulation instructions. 33.

(52) Table 3.2: System Register File [2] No. Name. Size(bits). Note. SR0. PREDN. 16. Predication information. SR1. EN INT. 1. Interrupt enable flag. SR2. MSK EXC. 16. Mask inside exception. SR3. SWI EXC. 16. Software exception. SR4. CF0. 32. Custom function register 0. SR5. CF1. 32. Custom function register 1. SR6. CF2. 32. Custom function register 2. SR7. CF3. 32. Custom function register 3. SR8. SD Status. 8. Mix information 0’s shadow register. SR9. SD CPC. 32. CPC’s shadow register (ISR return address). SR10. SD BCTG. 32. Branch target’s shadow register. SR11. SD R0. 32. R0’s shadow register. SR12. Mode. 4. Power mode register. SR13. CFU Info Sel. 4. CFU Info select register. SR14. EXC Cause. 16. Exception cause. SR15. Reserved. 32. N.A.. • Multiplication and accumulation instructions. • Special instructions. All data processing instructions in AU begin at the same stage but not finish at the same time due to different computing complexity. Load/Store Unit (L/S) The load/store unit (L/S) comprises 32-bit modules except for one 16-bit address generation unit (AGU) which is used to support the different addressing modes. The functional types supported by L/S are as follows: • Arithmetic and comparison instructions. 34.

(53) Figure 3.4: The VLIW datapath register organization [2]. • Data transfer instructions. • Bit manipulation instructions. • Load and store instructions. • Special instructions. Like AU, all instructions in L/S begin at the same stage but not finish at the same time due to different computing complexity. The L/S unit supports powerful double load/store instructions, which can load or store two operands in one instruction. It also supports instructions that load and store by bytes or half-words. These instructions make memory access easier and more convenient.. 3.5.1 Ping-Pong Register File The ping-pong register file contains sixteen 32-bit registers which are divided into two groups: D0–D7 and D8–D15. The AU and the L/S unit can access the ping-pong register file at the same time but it has to be in different groups. In other words, both units. 35.

(54) Figure 3.5: The four-way VLIW datapath of PACDSP [2]. cannot read or write the same group simultaneously. All possible access conditions are as follows: • LS reads D0–D7 and writes D0–D7, and AU reads D8–D15 and writes D8–D15. • LS reads D0–D7 and writes D8–D15, and AU reads D8–D15 and writes D0–D7. • LS reads D8–D15 and writes D0–D7, and AU reads D0–D7 and writes D8–D15. • LS reads D8–D15 and writes D8–D15, and AU reads D0–D7 and writes D0–D7.. 3.5.2 Address/Accumulator Registers As shown in Fig. 3.4, the address registers (A0–A7) are all 32-bit and they are dedicated to the load/store (L/S) unit for memory accesses. PACDSP supports several addressing modes. In modulo addressing mode, A0 and A2 are treated as pointers. A1 and A3 contain base addresses. A4 and A6 contain the values of end address plus one. A5 and A7 are treated as displacements. So it can support two groups of modulo addressing: (A0,A1,A4,A5) and (A2,A3,A6,A7). In other addressing modes, they can be used as address storage or data processing storage according to the design of the user. 36.

(55) The accumulator registers (AC0–AC7) are 40-bit registers which are dedicated to the arithmetic unit (AU) for data manipulations. The most significant eight bits are guard bits for accumulation operations.. 3.5.3 Constant Registers To avoid high frequency of data movement in the register file, PACDSP provides a small constant register file to keep fixed data. The constant register file has eight 32-bit registers (C0–C7). They can be read as either the first operand or the second operand in instructions that use them. But one instruction cannot access the constant register file as both of its source operands simultaneously. The constant register file can be read by both the AU and the L/S unit but can only be written by the L/S unit. All accesses to the constant register file must be pointed by the control flags CF0 and CF1, which are pointers to the constant registers. And they are calculated from the values contained in CF2 and CF3, which are the contents of the pointers.. 3.5.4 Status and Control Registers The status register and control register can be used to monitor the DSP kernel status and handle the operation mode of the DSP kernel. Program Status Register The program status register records the operation status in each cluster and the scalar unit. It includes Overflow, Negative, and Carry bits, and instructions can only read the status register but not set it. Addressing Mode Control Register (AMCR) There are several addressing modes supported by PACDSP. The addressing mode control register (AMCR) is a 16-bit register. This register is used to set the addressing mode for each address register. The addressing modes are related to where the operands are to be 37.

(56) found and how the address calculations are to be made. The definitions are shown in Table 3.3.. 3.5.5 Addressing Modes The addressing modes are related to where the operands are to be found and how the address calculations are to be made. PACDSP supports Linear Addressing Mode, BitReverse Addressing Mode, and Modulo Addressing Mode for memory access. They can be altered by setting the AMCR. Table 3.4 shows the syntax of addressing modes that be used and the associated supporting units. Fig. 3.6 shows that A0–A7 are the address register file and they are classified into even and odd banks in linear and bit-reverse addressing modes. Some addressing modes use 2 address registers, RsA and RsB, at the same time. They must be consecutive registers with RsA in the even bank and RsB in the odd bank. Linear Addressing Mode • Offset by immediate (RsA, displacement) The operand address is the sum of the contents of the address register (RsA) and the displacement (up to 24-bit signed integer, but the value range depends on the implementation of data memory). • Offset by register (RsA, RsB) The operand address is the sum of the contents of the address register (RsA) and the contents of the address register (RsB). Table 3.3: Definitions of AMCR (from [2]) AM[1]. AM[0]. Addressing Mode. 0. 0. Linear. 0. 1. Bit-reversed. 1. 0. Modulo. 1. 1. Reversed. 38.

(57) Table 3.4: Syntax of Address Modes and Supporting Units [3] Addressing Mode. Syntax. 1. Linear. Support Unit Scalar. Cluster. Offset by Immediate. RsA, displacement. V. V. Offset by Register. RsA, RsB. V. V. Post-increment by Immediate. RsA, displacement+. V. V. Post-increment by Register. RsA, RsB+. V. V. Scalar. Cluster. 2. Modulo Post-increment by Register. RsA, RsB+. -. V. Post-increment by Immediate. RsA, displacement+. -. V. Scalar. Cluster. 3. Bit-Reverse Post-increment by Immediate. RsA, displacement+. -. V. Post-increment by Register. RsA, RsB+. -. V. Figure 3.6: Address register file [2]. • Post-increment by immediate (RsA, displacement+) The operand address is in the address register RsA. After the operand address is used, it is incremented by the displacement (up to 24-bit signed integer, but the value range depends on the implementation of data memory) and stored in the same address register. • Post-increment by register (RsA, RsB+) The operand address is in the address register RsA. After the operand address is used, it is incremented by the register (RsB) and stored in the same address register. 39.

(58) Bit-Reverse Addressing Mode Bit-reverse addressing mode is also called reverse-carry addressing mode. It is useful for 2k -point FFT addressing. This mode is selected by setting the corresponding bits in AMCR, and address modification is performed in the hardware by propagating the carry from each pair of added bits in the reverse direction (from the MSB end toward the LSB end). It only supports the post-increment by immediate and post-increment by register. This address modification is useful for addressing the twidle factors in 2k point-FFT addressing as well as to unscramble 2k -point FFT data. Modulo Addressing Mode Modulo address modification is useful for creating circular buffers for FIFO queues, delay lines, and sample buffers. This addressing mode only supports post-increment by immediate and post-increment by register. The definition of modulo addressing, using a base register (Bn) and a end register (En), enables the programmer to locate the modulo buffer at any address. The current address register, An, can initially point anywhere (aligned to its access width) within the defined modulo address range, Bn ≤ An < En. Modulo addressing can be selected by configuring corresponding bits in AMCR. The range of modulo registers is from 1 to 216 − 1.. 3.5.6 Data Communication The PACDSP provides fast data communication mechanism among scalar unit and two clusters. As shown in Fig. 3.7, it provides a data exchange mechanism between any two of the scalar unit and the two clusters. Figure 3.8 shows that it can also provide data broadcast to facilitate one of them to broadcast its data to the others. This job is accomplished by using the ports of the memory interface unit (MIU) because MIU has connections with all register files of the scalar unit and the two clusters. It only needs one instruction latency.. 40.