使用具單指令多資料流程之Intel處理器實現MPEG-4即時視訊編碼

全文

(1)國立交通大學電機資訊學院電子與光電學程碩士論文. 使用具單指令多資料流程之 Intel 處理器實現 MPEG-4 即時視訊編碼 Real-Time Implementation of MPEG-4 Video Encoder Using SIMD-Enhanced Intel Processor. 研究生：劉夢遠指導教授：林大衛. 博士. 中華民國九十三年七月.

(2) 使用具單指令多資料流程之 Intel 處理器實現 MPEG-4 即時視訊編碼 Real-Time Implementation of MPEG-4 Video Encoder Using SIMD-Enhanced Intel Processor. 研究生：劉夢遠. Student：Meng-Yuan Liu. 指導教授：林大衛博士. Advisor：Dr. David W. Lin. 國立交通大學電機資訊學院電子與光電學程碩士論文. A Thesis Submitted to Degree Program of Electrical Engineering Computer Science College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics and Electro-Optical Engineering July 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年七月.

(3) 使用具單指令多資料流程之 Intel 處理器實現 MPEG-4 即時視訊編碼. 研究生: 劉夢遠. 指導教授: 林大衛教授. 國立交通大學電機資訊學院電子與光電學程﹙研究所﹚碩士班. 摘要. MPEG-4 提供一些新的架構與工具來達成高壓縮率的視訊編碼。在本篇論文中，我們使用具單指令多資料流程之 Intel 處理器實現即時 MPEG-4 視訊編碼。主要以 Intel MMX 技術為主，包含 SSE 及 SSE2。MMX 是 Intel 公司為了在 Intel Architecture (IA) 的微處理器上能夠用來加強多媒體及通訊的處理能力所增加的延伸技術，採用了單指令多資料流程架構 (SIMD) 來平行處理資料運算。在程式執行方面，我們使用一公開的程式 Microsoft MPEG-4 Visual Reference Software 加以修改以完成 MPEG-4 即時視訊編碼。由於 MPEG-4 的壓縮方式需要非常大的計算量，因此要達成即時的壓縮和解壓縮必須要有高速的硬體和有效率的軟體互相配合。為了解決龐大運算量的問題，平行處理是一個相當有效的方式。平行處理簡單的說就是把原本需要排隊循序處理的工作變成讓數個工作能同時獨立地運算，以加速工作的進行。本篇論文即是使用 Intel 的 MMX 指令集來改寫 Microsoft MPEG-4 Visual Reference Software 的部分核心程式，增加平行處理的程度，達成加速程式執行的目的。而最後程式在 Intel Pentium 4 CPU 2.66G, 480MB RAM 及 Microsoft Windows XP Professional 作業系統下實際測試的結果，使用 MMX 技術配合其他演算法壓縮有形狀訊息的 CIF foreman 測試檔案可達每秒 30 張約為原始程式的 6 倍左右。在本篇論文中，我們會先簡單介紹 MPEG-4 的系統架構與 Intel MMX 技術及指令，然後我們會針對 MPEG-4 實作時的改善與加速提供詳細的介紹，我們會將 i.

(4) 加速後的程式與原先的程式作比較，並討論其優缺點。論文最後做一結論並提出未來可再繼續發展的主題。. ii.

(5) Real-Time Implementation of MPEG-4 Video Encoder Using SIMD-Enhanced Intel Processor. Student: Meng-Yuan Liu. Advisor: Dr. David W. Lin. Degree Program of Electrical Engineering Computer Science National Chiao Tung University. Abstract The MPEG-4 standard is a very efficient coding standard for multimedia data defined by ISO/IEC MPEG. In this thesis, we use SIMD-enhanced Intel Processor to deal with MPEG-4 video encoding and to achieve the goal of real-time coding. The main technology is Intel’s MMX including SSE and SSE2. The Intel MMX technology was introduced into the Intel Architecture (IA) processor. The extension introduced in the MMX technology support a single-instruction, multiple-data (SIMD) execution model that is designed to accelerate the performance of advanced media and communications applications. In this thesis, we use the public-domain software, Microsoft MPEG-4 Visual Reference Software, to establish an MPEG-4 coding and decoding system. We need high-processing-speed hardware and effective software to achieve real-time MPEG-4 video compression and decompression, and parallel processing is the practical method that can solve huge computation problem in MPEG-4 encoding and decoding. Parallel processing means letting several independent operations or tasks run in parallel simultaneously, and then it can speed up the whole processing by this method. In this. iii.

(6) thesis, we modify some kernels of Microsoft MPEG-4 Visual Reference Software using Intel’s MMX technology to get more parallel processing ability to speed up the encoding processing. After optimization, we can encoder CIF foreman test sequence with shape information up to 30 frame per second on our test system. The test system is based on Intel Pentium 4 CPU 2.66G, 480MB RAM and Microsoft Windows XP Professional Version 2002. The speed-up is approximately 6 times than the original reference software. In our thesis, we introduce the MPEG-4 and Intel’s MMX technology first. Then we discuss the optimization of the MPEG-4 video encoder by using Intel’s MMX technology. We also present experimental results on the speed and the rate-distortion performance of the optimized code. Finally, we give a conclusion and point out some subjects for potential future work.. iv.

(7) 誌謝. 本篇論文的產生，最感謝的是我的指導教授－林大衛老師。身為一位在職研究生要兼顧研究學習及工作本來是件不容易的事，但老師給予我多方面的協助與鼓勵，讓我在研究學習中能夠順利進行，使我在研究學習及工作中得以兼顧而能完成本篇論文。老師除給了我完善的專業知識訓練外，在論文實作部份，老師更培養了我認真踏實的的研究態度，讓我獲益良多。. 實驗室完善的資源，讓我可以順利的克服許多實作上所遇的的困難。感謝詹益鎬學長不吝提供其相關的研究經驗以及建議，讓我獲益多。另外，感謝岳賢、沛昀、彥福、建興、宗書、崑健、建統等實驗室所有同伴，他們的勉勵與幫助，讓我在工作中亦保有快樂的研究生生涯。. 最後，要特別感謝我的爸媽、弟弟與我摯愛的妻子惠貞。尤其是母親與妻子，他們對於家庭的細心照顧讓我可在無後顧之憂下全力進行研究學習及工作，也給了我精神上的支持與鼓勵，陪伴我度過所有的艱難。在此，我要將我的論文獻給所有關心與幫助我的人。. 劉夢遠民國九十三年七月於新竹. v.

(8) Contents 1 Introduction. 1. 2 Overview of MPEG-4. 3. 2.1. Organization of the MPEG-4 Standard . . . . . . . . . . . . . . . . . . .. 4. 2.2. MPEG-4 Video Coding Overview (from [3]) . . . . . . . . . . . . . . . .. 7. 2.2.1. Structure of MPEG-4 Video Data . . . . . . . . . . . . . . . . .. 7. MPEG-4 Video Texture Coding (from [5], [6] and [7]) . . . . . . . . . .. 10. 2.3.1. VOP Formation . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 2.3.2. Shape Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 2.3.3. Motion Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16. 2.3.4. Texture Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. Other Video Coding Tools (from[5]) and Profiles and Levels (from[3]) . .. 29. 2.4.1. Other Video Coding Tools . . . . . . . . . . . . . . . . . . . . .. 29. 2.4.2. Profiles and Levels . . . . . . . . . . . . . . . . . . . . . . . . .. 31. 2.3. 2.4. 3 Intel’s MMX Technology and Tools for Software Optimization 3.1. 3.2. 33. Intel’s MMX Technology (from [8], [9] and [10]) . . . . . . . . . . . . .. 33. 3.1.1. MMX Technology Overview . . . . . . . . . . . . . . . . . . . .. 34. 3.1.2. MMX Instruction Sets Introduction . . . . . . . . . . . . . . . .. 37. 3.1.3. SSE and SSE2, Later Extensions of MMX Technology . . . . . .. 41. Software Tools for Implementation . . . . . . . . . . . . . . . . . . . . .. 45. 3.2.1. Intel C++ Compiler (from [11]) . . . . . . . . . . . . . . . . . .. 45. 3.2.2. Intel VTune (from [12]) . . . . . . . . . . . . . . . . . . . . . .. 45. vi.

(9) 4 MPEG-4 Video Encoder Optimization by Intel MMX Technology. 49. 4.1. Introduction to Microsoft MPEG-4 Visual Reference Software . . . . . .. 49. 4.2. Code Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 4.2.1. Motion Estimation Optimization . . . . . . . . . . . . . . . . . .. 53. 4.2.2. Motion Estimation Optimization Using Fast Motion Search. . . .. 60. 4.2.3. VOP Formation Optimization . . . . . . . . . . . . . . . . . . .. 64. 4.2.4. DCT and IDCT Optimization . . . . . . . . . . . . . . . . . . .. 66. 4.2.5. Motion Compensation Optimization . . . . . . . . . . . . . . . .. 67. 4.2.6. Quantization Optimization . . . . . . . . . . . . . . . . . . . . .. 70. Conclusion in Optimization . . . . . . . . . . . . . . . . . . . . . . . . .. 71. 4.3. 5 Experimental Results 5.1. 5.2. 73. Encoding Speed Performance . . . . . . . . . . . . . . . . . . . . . . . .. 73. 5.1.1. Frame Based Coding . . . . . . . . . . . . . . . . . . . . . . . .. 73. 5.1.2. Shape Based Coding . . . . . . . . . . . . . . . . . . . . . . . .. 75. Rate-Distortion (R-D) Performance . . . . . . . . . . . . . . . . . . . .. 76. 5.2.1. Frame Based Coding . . . . . . . . . . . . . . . . . . . . . . . .. 78. 5.2.2. Shape Based Coding . . . . . . . . . . . . . . . . . . . . . . . .. 78. 6 Conclusion and Future Work. 82. vii.

(10) List of Figures 2.1. A high level view of an MPEG-4 terminal (from[5]). . . . . . . . . . . .. 4. 2.2. Segmentation of a picture in to VOPs (from [5]). . . . . . . . . . . . . .. 8. 2.3. Logical structure of coded video data (from [7]). . . . . . . . . . . . . . .. 8. 2.4. Types of VOP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.5. Positions of luminance and chrominance samples in 4:2:0 data (from [6]). 10. 2.6. High level structure of VO based encoder (from [5]). . . . . . . . . . . .. 11. 2.7. Detailed structure of VO encoder (from [5]). . . . . . . . . . . . . . . . .. 12. 2.8. CR determination algorithm (from [6]). . . . . . . . . . . . . . . . . . .. 14. 2.9. Pixel templates used for (a) INTRA and (b) INTER context determination of BAB. The pixel to be coded is marked with “?”. . . . . . . . . . . . .. 15. 2.10 Gray shape coding (from [6]). . . . . . . . . . . . . . . . . . . . . . . .. 16. 2.11 Padding process (from [6]). . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 2.12 Priority of boundary MBs surrounding an exterior MB (from [6]). . . . .. 18. 2.13 Polygon matching for an arbitrary shape VOP (from [6]). . . . . . . . . .. 19. 2.14 Interpolation scheme for half sample search. . . . . . . . . . . . . . . . .. 20. 2.15 Motion vector prediction (from [6]). . . . . . . . . . . . . . . . . . . . .. 21. 2.16 Quantizers in H.263. (a) For intra DC coefficient only. (b) For inter DC and all AC coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 2.17 Prediction of DC coefficients of blocks in an intra MB (from[5]). . . . . .. 27. 2.18 Prediction of AC coefficients of blocks in an intra MB (from[5]). . . . . .. 27. 2.19 Scans for 8 × 8 blocks (from[3]). . . . . . . . . . . . . . . . . . . . . . .. 28. 3.1. 34. MMX execution environment. . . . . . . . . . . . . . . . . . . . . . . .. viii.

(11) 3.2. MMX packed data types (from [8]). . . . . . . . . . . . . . . . . . . . .. 35. 3.3. MMX register set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 3.4. SIMD execution model (form [9]). . . . . . . . . . . . . . . . . . . . . .. 37. 3.5. PACKSSDW instruction operation using 64-bit operands (form [10]). . .. 40. 3.6. PUNPCKLBW instruction operation using 64-bit operands (from [10]) . .. 40. 3.7. SSE execution environment (from [9]).. . . . . . . . . . . . . . . . . . .. 43. 3.8. Performance tuning methodology (from [12]). . . . . . . . . . . . . . . .. 47. 4.1. Breakdown of execution time in Microsoft MPEG-4 Visual Reference Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 4.2. Code segment of hotspots of blkmatch16. . . . . . . . . . . . . . . . . .. 53. 4.3. Revised code segment of SAD kernel of integer pixel motion search. . . .. 55. 4.4. PSADBW instruction operation using 64-bit operands(from [10]). . . . .. 56. 4.5. Original code segment of the SAD kernel of half pixel motion search. . .. 56. 4.6. Revised code segment of the SAD kernel of half pixel motion search. . .. 57. 4.7. Code segment of hotspots of blkmatch16WithShape function. . . . . . . .. 58. 4.8. Revised code segment of integer pixel SAD kernel of blkmatch16WithShape function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. The method of 2D logarithmic search (from [13]). . . . . . . . . . . . . .. 62. 4.10 The method of diamond search (from [16]). . . . . . . . . . . . . . . . .. 62. 4.11 The method of new diamond search (from [15]). . . . . . . . . . . . . . .. 63. 4.12 VOP formation (from [6]). . . . . . . . . . . . . . . . . . . . . . . . . .. 65. 4.13 Code segment of hotspots of findBestBoundingBox function. . . . . . . .. 66. 4.14 Revised code segment of hotspots of findBestBoundingBox function. . . .. 66. 4.15 Revised code segment of DCT. . . . . . . . . . . . . . . . . . . . . . . .. 68. 4.16 Code segment of hotspots of motionCompEncY function. . . . . . . . . .. 69. 4.17 Revised code segment of hotspots of motionCompEncY function. . . . .. 69. 4.18 Code segment of abs using SSE2. . . . . . . . . . . . . . . . . . . . . .. 70. 4.9. 4.19 Comparison between original reference software and optimized code in execution time for motion estimation. . . . . . . . . . . . . . . . . . . .. ix. 71.

(12) 4.20 Comparison between original reference software and optimized code in execution time for other encoder blocks . . . . . . . . . . . . . . . . . .. 72. 5.1. R-D performance in coding akiyo cif without shape. . . . . . . . . . . .. 79. 5.2. R-D performance in coding foreman cif without shape. . . . . . . . . . .. 79. 5.3. R-D performance in coding stefan cif without shape. . . . . . . . . . . .. 80. 5.4. R-D performance in coding akiyo cif with shape. . . . . . . . . . . . . .. 80. 5.5. R-D performance in coding foreman cif with shape. . . . . . . . . . . . .. 81. 5.6. R-D performance in coding stefan cif with shape. . . . . . . . . . . . . .. 81. x.

(13) List of Tables 2.1. Default Quantization Matrix Q (from [3]) . . . . . . . . . . . . . . . . .. 25. 2.2. Nonlinear Scaler for DC Coefficients of DCT Blocks (from[3]). . . . . .. 25. 2.3. Profiles and Tools (from[3]) . . . . . . . . . . . . . . . . . . . . . . . .. 32. 3.1. MMX Instruction Set Summary . . . . . . . . . . . . . . . . . . . . . .. 39. 3.2. Features and Benefits of Intel C++ Compiler (from [11]) . . . . . . . . .. 46. 3.3. Functional Units and Operations Performed (from [12]) . . . . . . . . . .. 48. 4.1. Source Files and Directories Arrangement of MPEG-4 Video Reference Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 4.2. Funtionalities of Microsoft MPEG-4 Video Reference Software . . . . . .. 51. 4.3. Major Functions of Motion Estimation . . . . . . . . . . . . . . . . . . .. 53. 4.4. Execution Result of Optimized blkmatch16 Function Using MMX . . . .. 56. 4.5. Execution Result of Optimized blkmatch16WithShape Function Using MMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 4.6. Execution Result of Optimization of Motion Estimation Using MMX . .. 60. 4.7. Execution Results of Optimization of blkmatch16 and blkmatch16WithShape Using Fast Motion Search Method . . . . . . . . . . . . . . . . . . . . .. 63. 4.8. Execution Result of Optimization of Motion Estimation. 64. 4.9. Execution Result of Optimization of findBestBoundingBox Function. . . . . . . . . . . .. 67. 4.10 Execution Result of Optimization of DCT and IDCT . . . . . . . . . . .. 67. 4.11 Execution Result of Optimization of Motion Compensation. . . . . . . .. 70. 4.12 Execution Result of Quantization Optimization . . . . . . . . . . . . . .. 70. xi.

(14) 5.1. Overall Coding Speed Without Shape in Average CIF Frame per Second Using Debug Compilation Mode . . . . . . . . . . . . . . . . . . . . . .. 5.2. Overall Coding Speed Without Shape in Average CIF Frame per Second Using Release Compilation Mode . . . . . . . . . . . . . . . . . . . . .. 5.3. 75. Overall Coding Speed With Shape in Average CIF Frame per Second Using Debug Compilation Mode . . . . . . . . . . . . . . . . . . . . . . .. 5.4. 74. 76. Overall Coding Speed With Shape in Average CIF Frame per Second Using Release Compilation Mode . . . . . . . . . . . . . . . . . . . . . . .. xii. 77.

(15) Chapter 1 Introduction The MPEG-4 standard was originally intended for very high compression coding of audio-visual information at very low bit-rate. Later the scope of MPEG-4 was extended to address not only compression, but also new audio-video coding techniques for contentbased interactivity and universal access. In addition to the conventional “frame” based functionalities of MPEG-1 and MPEG-2 standards, the MPEG-4 video coding will also support access and manipulation of “objects” within video scenes. Because the computation of MPEG-4 video encode is quiet huge, we need highprocessing-speed hardware and effective software to achieve real-time MPEG-4 video compression and decompression, and parallel processing is a practical technique that can solve huge computation problem in MPEG-4 encoding and decoding. Parallel processing means letting several independent operations or tasks run in parallel simultaneously. Then the speed of processing can be increased. We consider implementation of the MPEG-4 video encoder in software on Intel processor. The implementation is based on the code from Microsoft MPEG-4 Visual Reference Software. It is a public source for MPEG-4 encoding and decoding. In order to achieve real-time performance, we use Intel’s MMX instructions to modify some kernels of the reference software. Intel’s MMX technology was introduced in to the IA architecture processor [8]. The extension introduced in the MMX technology support a single-instruction, multiple-data (SIMD) execution model that is designed to accelerate the performance of advanced media and communications applications [8]. 1.

(16) This thesis is organized as follows. Chapter 2 is an overview of MPEG-4. Chapter 3 describes Intel’s MMX technology and some software tools that we use. Chapter 4 discusses the detailed optimization methods for using Intel’s MMX technology. The overall experimental results of the MPEG-4 encoder after optimization are described in Chapter 5. Finally, Chapter 6 contains the conclusion.. 2.

(17) Chapter 2 Overview of MPEG-4 MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the well known MPEG-1 and MPEG-2 standards. These standards made interactive video on CD-ROM, DVD and Digital Television possible. MPEG-4 is a newer standard started in 1994, with the mandate to standardize algorithms for audio-visual coding in multimedia applications. MPEG-4, formally designated “ISO/IEC 14496”, was finalized in October 1998 and became an International Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early in 2000. Several extensions were added since and work on some specific items is still in progress [2]. MPEG-4 builds on the proven success of three fields: • digital television, • interactive graphics applications (synthetic content), and • interactive multimedia (World Wide Web, distribution of and access to content). In this chapter, we introduce the overall organization of the MPEG-4 standard, its video texture coding scheme, and some special video coding tools.. 3.

(18) Figure 2.1: A high level view of an MPEG-4 terminal (from[5]).. 2.1 Organization of the MPEG-4 Standard The MPEG-4 standard addresses the generic coding of audio-visual objects, as illustrated in Figure 2.1. It (ISO/IEC 14496) consists of the following basic parts (The following description of the different parts are mainly taken from [1] and [2]). 1. ISO/IEC 14496-1: Systems The MPEG-4 Systems specification defines architecture and tools to create audiovisual scenes from individual objects. A major tool for MPEG-4 systems is scene description. The MPEG-4 scene description, a totally new component in the MPEG specifications, is based on VRML (virtual reality modeling language) and specifies the spatial-temporal composition of objects in a scene. The scene description is at the core of the systems specification, and allows easy creation of compelling audio-visual content. 2. ISO/IEC 14496-2: Visual 4.

(19) The MPEG-4 visual specification defines the main video codec. It consists of natural, arbitrary shape and synthetic video coding. For natural video coding, the main video coding tools are still texture coding, similarly to MPEG-1 and MPEG-2. For intra coding, the MPEG-4 visual specification uses DCT, IDCT, intra prediction, quantization and de-quantization to reduce spatial redundancy. For inter coding, the MPEG-4 visual specification uses motion estimation and motion compensation to reduce temporal redundancy. In visual coding, the major difference from MPEG-1 and MPEG-2 is object coding. In MPEG-4, each picture is considered as consisting of objects, since some MPEG-4 functionalities require access not only to entire pictures but also to objects. For synthetic video coding, in MPEG-4, mesh-based representation is useful. MPEG4 includes a tool for triangular mesh-based representation of general objects. 3. ISO/IEC 14496-3: Audio ISO/IEC 14496-3 (MPEG-4 Audio) is a new kind of audio standard that integrates many different types of audio coding: natural sound with synthetic sound, low bitrate delivery with high-quality delivery, speech with music, complex sound tracks with simple ones, and traditional content with interactive and virtual-reality content. MPEG-4, unlike previous audio standards created by ISO/IEC and other groups, does not target at a single application such as real-time telephony or high-quality audio compression. MPEG-4 Audio is a rather generic standard that applies to applications requiring the use of advanced sound compression, synthesis, manipulation, or playback. The subparts specify state-of-the-art coding tools in several domains. However, MPEG-4 Audio is more than just the sum of its parts. As the tools described are integrated with the rest of the MPEG-4 standard, new possibilities for object-based audio coding, interactive presentation, dynamic sound tracks, and other sorts of new media, are enabled. 4. ISO/IEC 14496-4: Conformance Testing This part of ISO/IEC 14496 specifies how tests can be designed to verify whether bitstreams and decoders meet requirements specified in parts 1, 2, and 3 of ISO/IEC 5.

(20) 14496. In this part of ISO/IEC 14496, encoders are not addressed specifically. An encoder may be said to be an ISO/IEC 14496 encoder if it generates bitstreams compliant with the syntactic and semantic bitstreams requirements specified in parts 1, 2 and 3 of ISO/IEC 14496. 5. ISO/IEC 14496-5: Reference Software Reference software is normative in the sense that any conforming implementation of the software, taking the same conforming bitstreams, using the same output file format, will output the same file. Complying ISO/IEC 14496 implementations are not expected to follow the algorithms or the programming techniques used by the reference software. Although the decoding software is considered normative, it cannot add anything to the technical description included in parts 1, 2, 3 and 6 of ISO/IEC 14496. 6. ISO/IEC 14496-6: DMIF DMIF, or Delivery Multi-media Integration Framework, is an interface between the application and the transport, which enables the MPEG-4 application developer to stop worrying about the transport. A single application can run on different transport layers when supported by the right DMIF instantiation. MPEG-4 DMIF supports the following functionalities: • A transparent MPEG-4 DMIF-application interface irrespective of whether the peer is a remote interactive peer, broadcast or local storage media. • Control of the establishment of FlexMux channels. • Use of homogeneous networks between interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN. • Support for mobile networks, developed together with ITU-T. • User commands with acknowledgment messages. • Management of MPEG-4 Sync Layer information.. 6.

(21) 2.2 MPEG-4 Video Coding Overview (from [3]) The target of MPEG-4 video is providing standardized core technologies allowing efficient storage, transmission and manipulation of video data in multimedia environments. It provides technologies to view, access and manipulate objects rather than pixels, with great error robustness at a large range of bit-rates. In order to achieve this broad goal, video activities in MPEG-4 aim at providing solutions in the form of tools and algorithms enabling functionalities such as efficient compression, object scalability, spatial and temporal scalability, error resilience, and fine granularity scalability. The standardized MPEG-4 video provides a toolbox containing tools and algorithms bringing solutions to the above mentioned functionalities and more.. 2.2.1 Structure of MPEG-4 Video Data An input video sequence can be defined as a sequence of related snapshots or pictures, separated in time. Many of MPEG-4 functionalities require access not only to entire sequence of pictures, but to an entire object, and further, not only to individual pictures, but also to temporal instances of these objects within a picture. The concept of Video Objects (VOs) and their temporal instances, Video Object Planes (VOPs) is central to MPEG-4 video. A VOP can be fully described by a set of luminance and chrominance values and shape representation. In Figure 2.2, we show the decomposition of a picture into a number of separate VOPs. Each VO is encoded separately and multiplexed to form a bitstream that users can access and manipulate. The encoder sends, together with VOs, information about scene composition to indicate where and when VOPs of a VO are to be displayed. Figure 2.3 shows the organization of coded MPEG-4 Video in a top-down hierarchical structure. • VideoSession (VS): A Video session is the highest syntactic structure of the coded visual bitstream and simply consists of an ordered collection of video objects. The complete MPEG-4 scene which may contain any 2-D or 3-D natural or synthetic objects.. 7.

(22) Figure 2.2: Segmentation of a picture in to VOPs (from [5]).. Figure 2.3: Logical structure of coded video data (from [7]).. 8.

(23) I−frame. P−frame. B−frame P−frame. I−frame. Figure 2.4: Types of VOP. • VideoObject (VO): A Video object (2D + time) represents a complete scene or a portion of a scene with a semantic. In the simplest case this can be a rectangular frame, or it can be an arbitrarily shaped object corresponding to a physical object or background of the scene. • VideoObjectLayer (VOL): Each video object can be encoded in scalable (multilayer) or non-scalable form (single layer), depending on the application, represented by VOL. The VOL provides support for scalable coding. A video object can be encoded using spatial or temporal scalability, going from coarse to fine resolution. • GroupOfVideoObjectPlanes (GOV): Group of video object planes are optional entities. The GOV groups together video object planes. GOVs can provide points in the bitstream where video object planes are encoded independently from each other, and can thus provide random access points into the bitstream. • VideoObjectPlane (VOP): A VOP is a time sample of a video object. Figure 2.4 shows three of the four types of VOP that use different coding methods: 1. An Intra-coded (I) VOP is coded using information only from itself. 2. A Predictive-coded (P) VOP is a VOP which is coded using motion compensated prediction from a past reference VOP. 3. A Bidirectionally predictive-coded (B) VOP is a VOP which is coded using motion compensated prediction from a past and/or future reference VOP(s). 9.

(24) Figure 2.5: Positions of luminance and chrominance samples in 4:2:0 data (from [6]) 4. A sprite (S) VOP is a VOP for a sprite object or a VOP which is coded using prediction based on global motion compensation from a past reference VOP. The macroblock (MB) is a basic coding structure constructing VOP. In the MPEG4 standard, a macroblock contains a section of the luminance component and the subsampled chrominance components in 4:2:0 format. In this format, there are 4 luminance blocks and 2 chrominance blocks in a macroblock. The luminance and chrominance samples are positioned as shown in Figure 2.5.. 2.3 MPEG-4 Video Texture Coding (from [5], [6] and [7]) Figure 2.6 shows a high level logical structure of a VO based encoder. The main components are VO segmenter/formatter, VO encoders, system multiplexer/demultiplexer, VO decoders and VO compositor. We will introduce more details of VO encoders in this 10.

(25) Figure 2.6: High level structure of VO based encoder (from [5]). section. Figure 2.7 presents the internal structure of the VO encoder. The same encoding scheme is applied in coding all the VOPs of a given session. The encoder has an entirely new component compared to previous video coding standards: arbitrary shape coding.. 2.3.1 VOP Formation After segmentation, the video object shape information is obtained. The shape information is hereafter referred to as alpha plane. There are two kinds of alpha plane. One is binary alpha plane which contains two kinds of data. The value 255 is assigned to pixels belonging to the objects and 0 is assigned to pixels outside the objects. The other one is grey scale alpha plane which is used for hybrid (of natural and synthetic) scenes generated by blue screen composition and is represented by an 8-bit component. The alpha plane is used to form a VOP. For the binary alpha plane, a rectangular bounding box enclosing the shape to be coded is formed such that its horizontal and vertical dimensions are extended to multiples of 16 pixels (MB size). For efficient coding, it is important to minimize the number of macroblocks contained in the bounding box.. 11.

(26) Figure 2.7: Detailed structure of VO encoder (from [5]).. 2.3.2 Shape Coding After VOP formation, the alpha plane of VOP will be coded prior to coding motion vector and texture based on the VOP image bounding box. Binary alpha planes are encoded by modified context-based arithmetic encoding (CAE) while grey scale alpha planes are encoded by motion compensated DCT similar to texture coding. An alpha plane is also bounded by an extended rectangular bounding box. The bounded alpha plane is partitioned into blocks of 16×16 samples (hereafter referred to as alpha blocks) and the encoding/decoding process is done per alpha block. Binary Shape Coding The basic tools for encoding binary alpha blocks (BABs) are CAE and motion compensation. InterCAE and IntraCAE are the variants of the CAE algorithm used with and without motion compensation, respectively. Motion vectors can be computed by searching for a best match position. The motion vectors themselves are differentially coded. Every BAB can be coded in one of the following modes: 1. The block is all transparent. In this case no coding is necessary. Texture information. 12.

(27) is not coded for such blocks either. 2. The block is all opaque. Again, shape coding is not necessary for such blocks, but texture information needs to be coded (since they belong to the VOP). 3. The block is coded using IntraCAE without use of past information. 4. Motion vector difference (MVD) is zero but the block is not updated. 5. MVD is non-zero, but the block is not updated. 6. MVD is zero and the block is updated. InerCAE is used for coding the block update. 7. MVD is non-zero, and the block is coded by InterCAE. If the encoder need rate control and rate reduction, the encoder realizes these through size-conversion of binary alpha information. The estimation of conversion ratio (CR) is iterative and consists of using the same factor in both dimensions and determining the acceptability of resulting shape quality. To be specific, a 4:1 downsampled binary alpha block is used first and if the shape errors are higher than acceptable, a 2:1 downsampled binary alpha block is used next, again if it is found unacceptable, an unsubsampled binary alpha block is used. Figure 2.8 shows the block diagram of CR determination. The selection is done based on the conversion error between the original BAB and the BAB which is once down-sampled and then reconstructed by up-sampling. The conversion error is computed for each 4times4 sub-block respectively by taking the sum of the absolute difference. If the sum is greater than a designated threshold value, this sub-block is called ”Error-PB (Pixel Block)” CAE encoding is used to code each binary pixel of the BAB. Prior to coding the first pixel, the arithmetic encoder is initialized. Each binary pixel is then encoded in raster order. The process for encoding a given pixel is the following: 1. Compute a context number. 2. Index a probability table using the context number. 3. Use the indexed probability to drive an arithmetic encoder. 13.

(28) START. SET CR=1/4. There is at least one Error-PB in MB. N. Y SET CR=1/2. There is at least one Error-PB in MB. N. Y SET CR=1. END. Figure 2.8: CR determination algorithm (from [6]).. 14.

(29) C3. C2. C0. ?. C1. Pixels of the current BAB. alignment. C9. C8. C7. C6. C5. C4. C3. C1. C0. ?. C8. C2. C7. C6. C5. Pixels of the bordered MC BAB. C4. ʻ˴ʼ. ʻ˵ʼ. Figure 2.9: Pixel templates used for (a) INTRA and (b) INTER context determination of BAB. The pixel to be coded is marked with “?”. When the final pixel has been processed, the arithmetic code is terminated. Figure 2.9 shows the computation of the contexts for INTRA and INTER modes. Gray Scale Shape Coding The gray scale shape information has a structure similar to that of binary shape with the difference that every pixel can take on a range of values (usually 0 to 255) representing the degree of the transparency of that pixel. The gray scale shape corresponds to the notion of alpha plane used in computer graphics, in which 0 corresponds to a completely transparent pixel and 255 to a completely opaque pixel. Intermediate values of the pixel correspond to intermediate degrees of transparencies of that pixel. Gray level alpha plane is encoded as its support function and the alpha values on the support. The support is obtained by thresholding the gray level alpha plane by 0. The support function is encoded by binary shape coding as described previously and the alpha values are encoded using a block based motion compensated DCT similar to that of texture coding. Figure 2.10 shows the block diagram of gray shape coding.. 15.

(30) Gray-Level Alpha. Support. Texture. Binary Shape Coder. Texture Coder. Figure 2.10: Gray shape coding (from [6]).. 2.3.3 Motion Coder There are four types of VOPs (see Figure 2.4 and associated discussion) that use different coding methods. Motion coding is necessary only for P-VOP and B-VOP to reduce temporal redundancy. The motion coder consists of a motion estimator, motion compensator, previous/next VOPs store and motion vector (MV) predictor and coder. In order to perform motion prediction on a per VOP basis, the motion estimation of the blocks on the VOP borders has to be modified from block matching to polygon matching. Furthermore, a special padding technique is required for the reference VOP. Padding Process The padding process defines the values of luminance and chrominance samples outside the VOP for prediction of arbitrarily shaped objects. Figure 2.11 shows a simplified diagram of this process. A decoded MB d[y][x] is padded by referring to the corresponding decoded shape block s[y][x]. A MB that lies on the VOP boundary is padded by replicating the boundary samples of the VOP towards the exterior. This process is divided into horizontal repetitive padding and vertical repetitive padding. The remaining MBs that are completely outside the VOP are filled by extended padding. • Horizontal repetitive padding: Each sample at the boundary of a VOP is replicated 16.

(31) horizontally to the left and/or right direction in order to fill the transparent region outside the VOP of a boundary macroblock. If there are two boundary sample values for filling a sample outside of a VOP, the two boundary samples are averaged. • Vertical repetitive padding: The remaining unfilled transparent samples from above procedure are padded by a similar process as the horizontal repetitive padding but in the vertical direction. The samples already filled in the horizontal repetitive padding are treated as if they were inside the VOP for the purpose of this vertical pass. • Extended padding: Exterior MBs immediately next to boundary macroblocks are filled by replicating the samples at the border of the boundary macroblocks. Note that the boundary macroblocks have been completely padded in horizontal and vertical repetitive padding. If an exterior macroblock is next to more than one boundary macroblocks, one of the macroblocks is chosen, according to the priority shown as Figure 2.12. The exterior macroblock is then padded by replicating upwards, downwards, leftwards, or rightwards the row of samples from the horizontal or vertical border of the boundary macroblock having the largest priority number. The remaining exterior macroblocks (not located next to any boundary macroblocks) are filled with 128.. Motion Estimation Motion estimation (ME) is a method of prediction between adjacent frames/pictures. This technique falls into two categories, pixel-based algorithms and block-based algorithms (BMA). The motion estimation method used in MPEG-4 encoder is block-based. In general, the ME techniques used in MPEG-4 can be seen as an extension of standard MPEG-1/2 or H.263 block matching techniques with modified block (polygon) matching. Figure 2.13 illustrates an example for polygon matching. The bounding rectangle of the VOP is first extended on the right-bottom side to multiples of macroblock size. Zero stuffing is used for these extended pixels. The alpha value of the extended pixels is set to zero. The MBs are formed by dividing the extended bounding rectangles into 16×16 blocks. SAD is used as error measure. The original alpha plane for the VOP is used to 17.

(32) Framestores. Predictions f [y][x] d’ [y][x] s [y][x]. Σ. s’ [y][x]. Saturation Horizontal Repetitive Padding. Vertical Repetitive Padding. Extended Padding. d [y][x] hor_pad [y][x]. hv_pad [y][x]. Figure 2.11: Padding process (from [6]).. Boundary macroblock 2. Boundary macroblock 3. Exterior macroblock. Boundary macroblock 1. Boundary macroblock 0. Figure 2.12: Priority of boundary MBs surrounding an exterior MB (from [6]).. 18.

(33) transparent pixels macroblock. VOP. Pixels for polygon matching. Figure 2.13: Polygon matching for an arbitrary shape VOP (from [6]). exclude the pixels of the MB that are outside the VOP. SAD is computed only for the pixels with nonzero alpha value. This forms a polygon for the MB that includes the VOP boundary. The reference VOP is padded based on its own shape information. For example, when the reference VOP is smaller than the current VOP, the reference is not padded up to the size of the current VOP. The basic motion estimation is performed on 16 × 16 luminance MB. The motion vector is specified to half-pixel accuracy. In many coding software implementations, the motion estimation is performed by full search to integer pixel accuracy vector and, using it as the initial estimate, a half pixel search is performed around it. In the MPEG-4 standard, besides motion vector for 16 × 16 MB, motion vector can be sent for individual 8 × 8 blocks to reduce more prediction errors. Both the 8 × 8 block motion compensation and overlapped motion compensated prediction are referred to as advanced prediction in H.263 and are adapted in MPEG-4 to work with arbitrary shaped VOPs. Because the motion vector may be non-integer number, sample interpolation is necessary. The process for interpolation of half sample values is carried out only in half sample mode, where the half sample values are calculated by bilinear interpolation as depicted in Figure 2.14. Using interpolation, the half-pixel motion vector can be calculated. Motion Vector Encoder When using INTER mode coding, the motion vector must be coded. Horizontal and vertical motion vector are coded differentially by using a spatial neighborhood of three motion vectors already coded (see Figure 2.15). These three motion vectors are candidate pre19.

(34) A a +. b. c. d. C. +. B. +. + Integer pixel position Half pixel position. D. +. a = A, b = (A + B + 1 - rounding_control) / 2 c = (A + C + 1 - rounding_control) / 2, d = (A + B + C + D + 2 - rounding_control) / 4 Figure 2.14: Interpolation scheme for half sample search. dictors for the differential coding. The differential coding of motion vectors is performed with reference to the reconstructed shape. In the special cases at the borders of the current VOP the following decision rules are applied: 1. If the MB of one and only one candidate predictor is outside the VOP, it is set to zero. 2. If the MBs of two and only two candidate predictors are outside the VOP, they are set to the third candidate predictor. 3. If the MBs of all three candidate predictors are outside the VOP, they are set to zero. The motion vector coding is performed separately on the horizontal and vertical components. For each component, the median value of the three candidates for the same component is used as predictor, denoted P x and P y, respectively: P x = Median(MV 1x, MV 2x, MV 3x), P y = Median(MV 1y, MV 2y, MV 3y). After finding the predictors, the vector differences MV Dx(= MV x − P x) and MV Dy(= MV y − P y) are coded by variable length coding. 20.

(35) MV2 MV3 MV1 MV. MV2 MV3. MV : Current motion vector MV1: Previous motion vector MV2: Above motion vector MV3: Above right motion vector. MV1 MV. (0,0) MV. MV2 (0,0). MV1 MV1. MV1 MV. : VOP border. Figure 2.15: Motion vector prediction (from [6]). Motion Compensation The motion compensator uses motion vectors to compute motion compensated prediction block, pred[i][j] from the same reference VOP. In addition to basic motion compensation processing, three alternalties are supported, namely, unrestricted motion compensation, four MV motion compensation and overlapped motion compensation. For unrestricted motion compensation, the motion vectors are allowed to point outside the decoded area of a reference VOP. For an arbitrary shape VOP, the decoded area refers to the area within the bounding box, padded as described above. When a sample referenced by a motion vector is outside the decoded VOP area, an edge sample is used. The pred[i][j] is defined as follows: xref = min(max(xcurr + dx, vhmcsr), xdim + vhmcsr − 1), yref = min(max(ycurr + dy, vvmcsr), ydim + vvmcsr − 1), where vhmcsr = vop horizontal mc spatial ref, vvmcsr = vop vertical mc spatial ref, (ycurr, xcurr) are the coordinates of a sample in the current VOP, (yref, xref ) are the coordinates of a sample in the reference VOP, (dy, dx) is the motion vector, and (ydim, xdim) are the dimensions of the bounding rectangle of the reference VOP.. 21.

(36) One/two/four vectors decision is indicated by the MCBPC codeword and field prediction flag for each macroblock. If one motion vector is transmitted for a certain macroblock, this is defined as four vectors with the same value as the MV. When two field motion vectors are transmitted, each of the four block prediction motion vectors has the value equal to the average of the field motion vectors (rounded such that all fractional pixel offsets become half pixel offsets). If MCBPC indicates that four motion vectors are transmitted for the current macroblock, the information for the first motion vector is transmitted as the codeword MVD and the information for the three additional motion vectors is transmitted as the codewords MVD2–4. If four vectors are used, each of the motion vectors is used for all pixels in one of the four luminance blocks in the macroblock. Overlapped motion compensation is performed when the flag obmc disable = 0. Each pixel in an 8 × 8 luminance prediction block is a weighted sum of three prediction values, divided by 8. The creation of each pixel P (i, j), in an 8 × 8 luminance prediction block is governed by the following equation: P (i, j) =. (p(i+M Vx0 ,j+M Vy0 )∗H0 (i,j)+p(i+M Vx1 ,j+M Vy1 )∗H1 (i,j)+p(i+M Vx2 ,j+M Vy2 )∗H2 (i,j)+4) , 8. where (MVx0 , MVy0 ) denotes the motion vector for the current block, (MV x1 , MVy1 ) denotes the motion vector of the block either above or below, (MV x2 , MVy2 ) denotes the motion vector either to the left or right of the current block, and H 0 (i, j), H1 (i, j), and H2 (i, j) denote the weighting of each pixel in the current block and neighbor blocks. Since the VOP may be coded in P or B mode, there are three types of motion vectors, forward mode, backward mode, and bi-directional mode. The different modes make different predictions P (i, j). 1. Forward mode Only the forward vector (MVFx,MVFy) is applied in this mode. The prediction blocks P y (i, j), P u (i, j), P v (i, j) are generated from the forward reference VOP. 2. Backward mode Only the Backward vector (MVBx,MVBy) is applied in this mode. The prediction blocks P y (i, j), P u (i, j), P v (i, j) are generated from the backward reference VOP. 22.

(37) 3. Bi-directional mode Both the forward vector (MVFx,MVFy) and the backward vector (MVBx,MVBy) are applied in this mode. The prediction blocks P y (i, j), P u (i, j), P v (i, j) are generated from the forward and backward reference VOPs by doing the forward prediction, the backward prediction and then averaging both predictions pixel by pixel.. 2.3.4 Texture Coder The texture information of a video object plane is present in the luminance Y and two chrominance components Cb and Cr of the video signal. In the case of an I-VOP, the texture information resides directly in the luminance and chrominance components. In the case of motion compensated VOPs the texture information represents the residual error remaining after motion-compensated prediction. The texture coder includes padding process (if needed), 8 × 8 block based DCT, quantization, coefficient prediction, coefficient scan and variable length coding. Padding Process When the shape of the VOP is arbitrary, there are two types of MBs that belong to an arbitrarily shaped VOP: 1. Those that lie completely inside the VOP shape. 2. Those that lie on the boundary of the shape. The macroblocks that lie completely inside the VOP are coded using a technique identical to the technique used in H.263. The macroblocks that lie on the boundary of the shape need to be padded before texture coding. For residual error blocks after motion compensation, the region outside the VOP within the blocks are padded with zero. For intra blocks, the padding is performed in a three-step procedure called low pass extrapolation (LPE). This procedure is as follows:. 23.

(38) 1. Compute the arithmetic mean vale m of the pixels f (i, j) in the blocks that belong to the VOP as. . m = (1/N). f (i, j),. (i,j)∈V OP. where N is the number of pixels situated with the VOP. Division by N is done by rounding to the nearest integer. 2. Assign m to each block pixel situated outside of the VOP region, that is, f (i, j) = m for all (i,j) ∈ / V OP. 3. Apply the following filtering operation to each block pixel f (i, j) outside of the VOP region, in raster-scan order: f (i, j) = [f (i, j − 1) + f (i − 1, j) + f (i, j + 1) + f (i + 1, j)]/4. Division is done by rounding to the nearest integer. If one or more of the four pixels used for filtering are outside the block, the corresponding pixels are not included into the filtering operation and the divisor 4 is reduced accordingly. For example, for i = 0 and j = 0, we have f (i, j) = [f (i, j + 1) + f (i + 1, j)]/2. After this padding operation the resulting block is ready for DCT coding. Discrete Cosine Transform Coding Similar to MPEG-1 and MPEG-2, the 2D (8×8) DCT is used for spatial data compression in MPEG-4 inter and intra coding. The encoder dose forward transform before quantization and inverse transform after inverse quantization in the loop. The reason for inverse quantization and inverse transform is to obtain reconstructed image for the next temporal frame.. 24.

(39) 3/2Q. Th+1/2Q. 1/2Q. −Th −Th−Q. −1/2Q. Th. −3/2Q. (a). (b). Figure 2.16: Quantizers in H.263. (a) For intra DC coefficient only. (b) For inter DC and all AC coefficients.. Table 2.1: Default Quantization Matrix Q (from [3]) (intra). (non intra). 8. 16. 19. 22. 26. 27. 29. 34. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 22. 24. 27. 29. 34. 37. 16. 16. 16. 16. 16. 16. 16. 16. 19. 22. 26. 27. 29. 34. 34. 38. 16. 16. 16. 16. 16. 16. 16. 16. 22. 22. 26. 27. 29. 34. 37. 40. 16. 16. 16. 16. 16. 16. 16. 16. 22. 26. 27. 29. 32. 35. 40. 48. 16. 16. 16. 16. 16. 16. 16. 16. 26. 27. 29. 32. 35. 40. 48. 58. 16. 16. 16. 16. 16. 16. 16. 16. 26. 27. 29. 34. 38. 46. 56. 69. 16. 16. 16. 16. 16. 16. 16. 16. 27. 29. 35. 38. 46. 56. 69. 83. 16. 16. 16. 16. 16. 16. 16. 16. Table 2.2: Nonlinear Scaler for DC Coefficients of DCT Blocks (from[3]) component. DC scaler for Quantizer (Q) range 1–4. 5–8. 9–24. 25–31. Luminance. 8. 2Q. Q+8. 2Q+16. Chrominance. 8. Q+13 2. 25. Q+16.

(40) Quantization MPEG-4 video supports two techniques of quantization (Q), one referred to as the H.263 quantization method and the other, the MPEG quantization method. The H.263 quantization method is with dead zone for intra and inter AC coefficients and with no dead zone for intra DC coefficients. The MPEG quantization method is uniform quantizer with the default matrix. Figure 2.16 shows the quantizer characteristics in H.263. It has uniform quantization for intra DC coefficients and nearly uniform midtread quantization for the inter DC and all AC coefficients. For AC data, input between −Th and +Th is quantized to zero. All coefficients in a macroblock go through the same quantizer. The step size Q can be changed in increments of 2 from 2 to 62 depending on rate controller. In the MPEG quantizer, each coefficient produced by 2D DCT is quantized with a uniform quantizer. The default quantizer matrix is defined as shown in Table 2.1. The default quantizer matrix can be changed by the rate controller if the required channel bandwidth is unavailable. Typically, the DC coefficients of DCT of blocks belonging to an intra macroblock are scaled by a constant scaling factor of 8. However, in MPEG-4 video, a nonlinear scaler as shown in Table 2.2 is used to provide a higher coding efficiency. The characteristics of nonlinear scaling are different between the luminance and chrominance blocks and further depend on the quantizer used for the block. Intra Prediction After quantization, the DC coefficients and many AC coefficients of an intra block are coded by intra prediction. Intra prediction is a new operation used in MPEG-4 standards to reduce the spatial redundancy between 8 × 8 blocks. There are two types of prediction, DC prediction and AC prediction. Figure 2.17 shows the prediction of DC coefficients in intra 8 × 8 blocks. The quantized intra coefficients are predicted with three previous decoded DC coefficients. For example, the DC coefficients of block X is predicted from the DC coefficients of blocks A, B and C. Unlike MPEG-2, the method of prediction in MPEG-4 standards is gradient 26.

(41) 00 0D 00 0 0 0 00 0C0 0 0 0 0 0 00 00 00 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0or00 00 00 00 00 00 00 00 00 00or00 00 00 00 00 00 00 00 00 00 00 00 A 00 00 00 00 00X00 00 00 00 00 00 00 00 00 00Y00 00 00 00 00 00 00 Macroblock 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0000000000000000000000 Figure 2.17: Prediction of DC coefficients of blocks in an intra MB (from[5]).. 00000000000000000 00000000000000000. B. 00 00 00 00 00 00 00 00 00 000. A. C. D. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 or00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 or 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 X00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00Y00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Macroblock 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00. Figure 2.18: Prediction of AC coefficients of blocks in an intra MB (from[5]). 27.

(42) Figure 2.19: Scans for 8 × 8 blocks (from[3]). based. In computing the prediction of block X, if the absolute value of a horizontal gradient is less than the absolute value of a vertical gradient, then the QDC of block C is used as the prediction, else QDC value of block A is used. The AC prediction depends on DC prediction, as shown in Figure 2.18. The AC coefficients in the first row or in the first column are predicted with three previous decoded AC coefficients. The direction of prediction is the same as DC prediction. Scan and VLC The predicted DC and AC coefficients (as well as the un-predicted AC coefficients) of DCT blocks are scanned by one of three scans: alternate-horizontal, alternate-vertical and zigzag (normal scan used in H.263 and MPEG-1) to change the 2D image to one dimensional data, see Figure 2.19. The actual scan used depends on the coefficient predictions used. For instance, if the DC prediction refers to the horizontally adjacent block, alternate-vertical scan is selected for the current block. If the DC prediction refer to the vertically adjacent block, alternate-horizontal scan is used for the current block. For all 28.

(43) other blocks, the 8 × 8 blocks of transform coefficients are zigzag scanned. The coefficients after scan usually become data with many zeros at the end. This kind of a data stream is good for run-length coding. In the MPEG-4 standard, differential DC coefficients in intra blocks are encoded in variable length codes. However, the AC coefficients are encoded by the variable length codes for EVENTs. An EVENT is a combination of a last non-zero coefficient indication, the number of successive zeros preceding the coded coefficient (RUN), and the non-zero value of the coded coefficient (LEVEL). Some statistically rare events have no variable length codes to represent them. For them an escape coding method is used.. 2.4 Other Video Coding Tools (from[5]) and Profiles and Levels (from[3]) 2.4.1 Other Video Coding Tools In addition to texture video coding, there are some special tools defined in MPEG-4. In this section, we shortly introduce robust video coding and scalable coding. Robust Video Coding Since the MPEG-4 standard supports the ability to access audio or video data over a diverse range, especially over wireless networks, error resilience is necessary. In the error resilient mode, the MPEG-4 video offers a number of tools as follows: 1. Object priorities The object based organization of MPEG-4 video potentially makes it easier to achieve a higher degree of error robustness due to the possibility of prioritizing each semantic object based on its relevance. Further, VOP types lend themselves to a form of automatic prioritization since, BVOPs are noncausal and do not contribute to error propagation and thus can be assigned a lower priority and perhaps even be discarded in case of severe errors. 29.

(44) 2. Resynchronization It is possible for an encoder to offer increased error resilience by placing resynchronization (resync) markers in the bitstreams with approximately constant spacing, such as beginning of each MB. 3. Data partitioning Data partitioning provides a mechanism to increase error resilience by separating the normal motion and texture data of all macroblocks in a video packet and send all of the motion data followed by a motion marker, followed by all of the texture data. 4. Reversible VLCs The reversible VLCs offer a mechanism for a decoder to recover additional texture data in the presence of errors since the special design of reversible VLCs enables decoding of codewords in both the forward (normal) and the reverse direction. 5. Intra update and scalable coding To prevent error propagation, intra update is a simple method to reduce this problem. However, more intra reduces less coding efficiency. Another method is scalable coding, which can prevent error propagation without more intra coding. Scalable Coding The scalability tools in MPEG-4 Video are designed to support applications beyond that supported by single layer video. The applications of scalability include internet video, wireless video, multi-quality video services, video database browsing, etc. In scalable video coding, it is assumed that given a coded bitstream, decoders of various complexities can decode and display appropriate reproductions of coded video. MPEG-4 Video provides several different forms of scalabilities that address non-overlapping applications with corresponding complexities. The basic scalability tools offered are temporal scalability and spatial scalability. The Fine Granularity Scalability (FGS) which supports continuous scalability of bit rate and 30.

(45) video quality is also defined.. 2.4.2 Profiles and Levels Although there are many tools in the MPEG-4 standard, not every MPEG-4 decoder will have to implement all of them. Similar to MPEG-2, profiles and levels are defined as subsets of the entire bitstreams syntax of all the tools. The purpose of defining conformance points in the form of profiles and levels is to facilitate interchange of bitstreams among different applications. There are eight profiles defined by MPEG-4: simple, core, main, simple scalable, animated & mesh, basic animated texture, still scalable texture profile and simple face. The detailed definitions are given in Table 2.4. Compared with the previous standards, the simple profile of MPEG-4 is similar to the coding method in H.263. The difference is that the simple profile has error resilience but does not have B-frame coding. The simple scalable profile is the same as simple profile, but with the rectangular scalability added. The core profile is the profile with all tools of the simple profile, temporal scalability, B-VOP coding and binary shape coding. The main profile is the profile with all tools in core profile, gray shape coding, interlace and sprite coding. The other profiles are for particular purposes, such as 2D dynamic mesh coding and facial animation coding.. 31.

(46) Table 2.3: Profiles and Tools (from[3]) Simple. Core. Main. Visual Tools. Simple. Animated. Basic. Still. Simple. Scalable. 2D Mesh. Animated. Scalable. Face. Texture. Texture. Face. V. V. V. V. V. Basic 1. I VOP 2. P VOP. V. V. V. V. V. V. V. V. V. V. V. V. 3. AC/DC Prediction 4. 4MV Unrestricted MV. Error resilience 1. Slice Resynchronization 2. Data Partitioning 3. Reversible VLC Sort Header. V. V. B-VOP. V. V. V. V. Method 1/Method 2. V. V. V. V. V. V. V. V. V. quantization. P-VOP based temporal scalability 1. Rectangular 2. Arbitrary Shape Binary Shape Grey Shape. V. Interlace. V. Sprite. V. Temporal Scalability. V. (Rectangular) Spatial Scalability. V. (Rectangular) Scalable Still Texture 2D Dynamic Mesh with uniform topology 2D Dynamic Mesh. V. with Delaunay topology Facial Animation. V. Parameters. 32.

(47) Chapter 3 Intel’s MMX Technology and Tools for Software Optimization As discussed previously, our goal is to achieve real-time implementation of MPEG-4 video encoder using Intel’s MMX technology. In this chapter, we will introduce Intel’s MMX technology including features of MMX, instruction set of MMX and extensions of MMX termed SSE and SSE2. We also introduce the software tools we use to help development.. 3.1 Intel’s MMX Technology (from [8], [9] and [10]) The multimedia extensions (MMX) for the Intel Architecture (IA) were designed to enhance performance of advanced media and communication applications. The MMX technology introduces new general-purpose instructions. These instructions operate in parallel on multiple data elements packed into 64-bit quantities. These instructions accelerate the performance of applications with compute-intensive algorithms that perform localized, recurring operations on small native data. This includes applications such as motion video, combined graphics with video, image processing, audio synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3D graphics. The MMX technology uses the single instruction, multiple data (SIMD) technique. This technique speeds up software performance by processing multiple data elements in 33.

(48) Address Space 32. 2 -1. MMX Registers Eight 64-bit. General-Purpose Registers Eight 32-Bit. 0. Figure 3.1: MMX execution environment. parallel, using a single instruction. The MMX technology supports parallel operations on byte, word, and doubleword data elements, and the new quadword (64-bit) integer data type.. 3.1.1 MMX Technology Overview The MMX technology defines a simple and flexible SIMD execution model to handle 64-bit packed integer data. This model adds the following new features to the IA: New data types, MMX registers and enhanced instruction set. All MMX instructions operate on MMX registers, the general-purpose registers, and/or memory as shown in Figure 3.1. • MMX registers: These MMX registers are used to perform operations on 64-bit packed integer data. • General-purpose registers: The eight general-purpose registers are used along with the existing IA-32 addressing mode to address operands in memory.. 34.

(49) Figure 3.2: MMX packed data types (from [8]). MMX Data Types The MMX technology introduced the following four new 64-bit data types as illustrated in Figure 3.2: • Packed byte: 8 bytes packed into one 64-bits quantity. • Packed word: 4 words packed into one 64-bits quantity. • Packed doubleword: 2 doubleword packed into one 64-bits quantity. • Packed quadword: One 64-bits quantity. The 64 bits are numbered 0 through 63. Bit 0 is the least significant bit (LSB), and bit 63 is the most significant bit (MSB). The low-order bits are the lower part of the data element and the high-order bits are the upper part of the data element. Bytes in a multibyte format have consecutive memory addresses. The ordering is little endian. That is, the bytes with lower addresses are less significant than the bytes with higher addresses.. 35.

(50) 0. 64 MM7 MM6 MM5 MM4 MM3 MM2 MM1 MM0. Figure 3.3: MMX register set. MMX Registers The MMX register set consists of eight 64-bit registers as shown in Figure 3.3, which are used to perform calculations on the MMX packed data but cannot be used to address memory. Values in MMX registers have the same format as a 64-bit quantity in memory. These registers are aliased to the floating-point registers. The MMX instructions access the MMX registers directly using the register names MM0 to MM7. Enhanced Instruction Set The MMX instruction set supplies a set of instructions that operate in parallel on all data elements of a packed data type. The MMX instructions implement two principles: operation on packed data and saturation arithmetic. • Operations on packed data: The MMX uses the SIMD technique for performing arithmetic and logic operations on bytes, words or doublewords packed into MMX registers as shown in Figure 3.4. • Saturation arithmetic: When performing integer arithmetic, an operation may result in an out-of-range condition, where the true result cannot be represented in the destination format. The MMX technology provide three ways to handle out-ofrange conditions. 36.

(51) Figure 3.4: SIMD execution model (form [9]). 1. Wraparound arithmetic. With wraparound arithmetic, an out-of-range value is truncated. That is, the carry or overflow bit is ignored and only the least significant bits of the result are return to the destination. Wraparound arithmetic is suitable for applications that control the range of operands to prevent out-ofrange results in the end. If the range of operands is not controlled, wraparound arithmetic can lead to large errors. 2. Signed saturation arithmetic. With singed arithmetic, out-of-range values are limited to the representable range of signed integers for the integer size being operated on. 3. Unsigned saturation arithmetic. With unsinged arithmetic, out-of-range values are limited to the representable range of unsigned integers for the integer size being operated on.. 3.1.2 MMX Instruction Sets Introduction This section provides an overview of MMX instruction groups. Detailed information on instructions, can be found in [10]. The MMX instructions are grouped into the following categories: • Data transfer • Arithmetic 37.

(52) • Comparison • Conversion • Unpacking • Logical • Shift • Empty MMX state instruction (EMMS) Table 3.1 gives a summary of the instructions in the MMX instruction set. Data Transfer Instructions We can transfer 32-bit or 64-bit data from memory to MMX registers and visa versa, or from integer registers to MMX registers and visa versa by a single instruction. We can transfer 32-bit data by MOVD and 64-bit data by MOVQ. Arithmetic The arithmetic instructions perform addition, subtraction, multiplication, and multiplyadd operation on packed data types. For example, PADDB, PADDSB and PADDUSB instructions add signed or unsigned packed byte integers in wraparound mode, signed packed byte integers in signed saturation mode, unsigned packed byte integers in unsigned saturation mode, respectively. Comparison Instructions The comparison instructions compare the packed data in the source and destination operands for equal to or greater than. These instructions generate a mask of ones or zeros which are written to the destination operand.. 38.

(53) Table 3.1: MMX Instruction Set Summary Category. Data Transfer Register to Register Load from Memory Store to Memory Arithmetic Addition Subtraction Multiplication Multiply and Add Comparison Compare for Equal Compare for Greater Than. Wraparound. Unpack Low. Usinged Saturation. 32-bit Transfers. 64-bit Transfers. MOVD MOVD MOVD. MOVQ MOVQ MOVQ. PADDB, PADDW, PADDD PSUBB, PSUBW, PSUBD PMULL, PMULH PMADD. PADDSB, PADDSW PSUBSB, PSUBSW. PADDUSB PADDUSW PSUBUSB, PSUBUSW. PACKSSWB, PACKSSDW. PACKUSWB. PCMPEQB, PCMPEQW, PCMPEQD PCMPGTPB, PCMPGTPW, PCMPGTPD. Conversion Pack Unpack Unpack High. Signed Saturation. PUNPCKHBW, PUNPCKHWD, PUNPCKHDQ PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ Packed. Full 64-bit. Logocal And And Not Or Exclusive OR. PAND PANDN POR PXOP. Shift Shift Left Logical Shift Right Logical Shift Right Arithmetic. PSLLW, PSLLD PSRLW, PSRLD PSRAW, PSRAD. Empty MMX State. EMMX. 39. PSLLQ PSRLQ.

(54) Figure 3.5: PACKSSDW instruction operation using 64-bit operands (form [10]). Conversion Instructions The conversion instructions perform conversions between the packed data types. For example, PACKSSDW instruction converts packed signed doubleword integers into packed signed word integers, using saturation to handle overflow conditions as shown in Figure 3.5 for an example of the packing operation. Unpack Instructions The unpack instructions unpack bytes, words, or doublewords from the high- or loworder elements of the source and destination operands and interleave them in destination operand. By placing all 0s in the source operand, these instruction can be used to convert byte integers to word integers, word integers to doubleword integers, or doubleword integers to quadword integers. For example, The PUNPCKLBW instruction interleaves the low-order bytes of the source and destination operands as shown in Figure 3.6. Figure 3.6: PUNPCKLBW instruction operation using 64-bit operands (from [10]). 40.

(55) Logical Instructions The logical instructions perform bitwise logical operations on 64-bit quantities. For example, we can generate a zero register in MM0 by using “PXOR mm0, mm0.” Shift Instructions The shift instructions have two types: logical shift and arithmetic shift. Logical shift instructions perform a logical left or right shift of the data elements and fill the empty high or low order bit position with zeros. Arithmetic shift instructions perform an arithmetic right shift, copying the sign bit for each data elements into empty bit positions on the upper end of each data elements. EMMS Instructions The EMMS instruction empties the MMX state. This instruction must be used to clear the MMX state at the end of an MMX routine before calling other routines that can execute floating-point instructions.. 3.1.3 SSE and SSE2, Later Extensions of MMX Technology The streaming SIMD extensions (SSE) were introduced into IA-32 architecture in the Pentium III processor family and the stream SIMD extensions 2 (SSE2) were introduced into IA-32 architecture in the Pentium 4 and Intel Xeon processor. Overview of SSE Extensions The SSE extensions extend the SIMD execution model, by adding facilities for handling packed or scalar single-precision floating-point values contained in 128-bit registers. The SSE extension add the following features to the IA-32 architecture. • Eight 128-bit data registers, call the XMM registers named by XMM0 to XMM7. • The 32-bit MXCSR register, which provides control and status bits for operations performed on the XMM registers. 41.

(56) • The 128-bit packed single-precision floating-point data (four IEEE single-precision floating-point values packed into a double quadword). • Instructions that perform SIMD operation on single-precision floating-point values and that extend the SIMD operations that can be performed on integers: – 128-bit packed and scalar single-precision floating-point instructions that operate on operands located in XMM registers. – 64-bit SIMD integer instructions that support additional operations on packed integer operands located in the MMX registers. • Instructions that save and restore the state of MXCSR register. • Instruction that support explicit prefetching of data, control of the cacheability of data, and control the ordering of store operations. • Extensions to the CPUID instruction. SSE Programming Environment Figure 3.7 shows the execution environment for the SSE extensions. All SSE instructions operate on the XMM registers and/or memory as follows: • XMM registers: These eight registers are used to operate on packed or scalar singleprecision floating-point data. The scalar operations are performed on individual single-precision floating-point values stored in low doubleword of an XMM register. • MXCSR register: This 32-bit register provides status and control bits used in SIMD floating-point operations. • MMX registers: This portion is the same as MMX. • General-purpose registers: This portion is the same as MMX. • EFLAGS register: This 32-bit register is used to record results of some compare operations. 42.

(57) Figure 3.7: SSE execution environment (from [9]). SSE Instruction Set The SSE instructions are divided into four functional groups • Packed and scalar single-precision floating instructions. • 64-bit SIMD integer instructions • State management instructions • Cacheablility control, prefetch, and memory ordering instructions. The instructions we used are 64-bit SIMD integer instructions for example, PSADBW. Detailed information on SSE instructions can be found in [9] Overview of SSE2 Extensions The SSE2 extensions use the same SIMD execution model that is used with the MMX technology and SSE extensions. The SSE2 extensions add the following features to the IA-32 architecture. • Five data types: 43.

(58) – 128-bit packed double-precision floating-point (two IEEE Standard 754 doubleprecision floating-point values packed into a double quadword). – 128-bit packed byte integers. – 128-bit packed word integers. – 128-bit packed doubleword integers. – 128-bit packed quadword integers. – Instructions that support explicit prefetching of data, control of the cacheability of data, and control the ordering of store operations. • Instructions to support the additional data type and extend existing SIMD integer operations: – Packed and scalar double-precision floating-point instructions. – Additional 64-bit and 128-bit SIMD integer instructions. – 128-bit versions of SIMD integer instructions introduced with MMX technology and the SSE extensions. – Additional cacheability-control and instruction-ordering instructions. The SSE2 program environment is same as SSE and no new registers are defined with the SSE2 extensions. SSE2 Instruction Set The SSE2 instructions are divided into four functional qroups • Packed and scalar double-precision floating instructions. • 64-bit SIMD and 128-bit SIMD integer instructions • 128-bit extensions of SIMD integer instructions introduced with the MMX technology and the SSE extensions • Cacheablility-control and instruction-ordering instructions. 44.

(59) The instructions we used are 128-bit SIMD integer instructions. All of the 64-bit SIMD integer instructions introduced with the MMX technology and the SSE extensions have been extended with the SSE2 extensions to operate on 128-bit packed integer operands located in the XMM registers. For example, where the 64-bit version of PADDB instruction operates on 8 packed bytes, the 128-bit version has been extended to operate on 16 packed bytes. Detailed information on SSE2 instructions can be found in [9]. 3.2 Software Tools for Implementation In this section, we introduce some tools that help software development. The first is Intel C++ compiler. The compiler is used to compile C and C++ code for Intel IA-32 and Itanium-based systems running Microsoft operating systems. The second is the “VTune Performance Analyzer,” which can help to analyze the performance of applications by locating hotspots. Hotspots are areas in code that take a long time to execute.. 3.2.1 Intel C++ Compiler (from [11]) The Intel C++ compiler optimizes performance for applications running on Intel architecturebased computers. The features and benefits of Intel C++ compiler are summarized in Table 3.2. This compiler has minimum system requirements including hardware and software. In order to use this compiler correctly we suggest to read the Release Note first.. 3.2.2 Intel VTune (from [12]) Intel VTune Analyzer helps to locate and remove software performance bottlenecks by collecting, analyzing, and displaying performance data from the system-wide level down to the source level. The VTune Analyzer provide multiple profiling technologies that enable optimization across multiple operating system platforms and development environments and support the latest Intel processors.. 45.