適用於高解析度靜態影像與視訊應用之H.264/MPEG-4 AVC框內編解碼器設計

(1)

國

立交通大學

電子工程學系電子研究所碩士班

碩士論文

適用於高解析度靜態影像與視訊應用之

H.264/MPEG-4 AVC 框內編解碼器設計

Design of H.264/MPEG-4 AVC Intra Codec for High

Definition Size Still Image and Video Applications

研究生：古君偉

指導教授：張添烜博士

中華民國九十五年七月

(2)

適用於高解析度靜態影像與視訊應用之

H.264/MPEG-4 AVC 框內編解碼器設計

Design of H.264/MPEG-4 AVC Intra Codec for High

Definition Size Still Image and Video Applications

研究生：古君偉

Student:

Chun-Wei

Ku

指導教授：張添烜博士 Advisor:

Dr.

Tian-Sheuan

Chang

國立交通大學

電子工程學系電子研究所

碩士論文

A Thesis

Submitted to Institute of Electronics

College of Electrical Engineering and Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for Degree of Master of Science

in

Electronic Engineering

July 2006

(3)

誌謝

首先，要感謝我的指導教授－張添烜博士，在研究所生涯中，除了給我許多支持與鼓勵之外，在研究上也常和我互相討論想法，解決我的困難與疑問。張教授的支援讓我無後顧之憂，得以專心致力於研究，才有這本論文的誕生。對於張教授的恩情，感激不盡。同時也要謝謝我的口試委員們，交大電子李鎮宜主任，清大電機陳永昌教授，感謝各位在百忙之中抽空前來指導我，各位教授的寶貴意見讓本篇論文得以更加完備。接著，我要感謝實驗室的夥伴。謝謝鄭朝鐘學長，帶領我進入視訊處理的領域，教導許多研究的技巧和寶貴的經驗，並加強我的設計功力，奠定往後比賽得獎或硬體設計的基礎。謝謝張彥中學長和林佑昆學長，給予課程或研究上的指導，讓研究能順利進行。謝謝王裕仁同學，和我一起參加兩屆IC 競賽皆拿下不錯的成績，並常常交流彼此的想法。也謝謝蔡旻奇同學和余國亘同學，一起共同完成編碼器的設計。此外，要謝謝海珊學長、史彥芪學長、吳錦木同學、子筠、嘉俊、得瑋、秈璟、英澤等學弟，你們的幫忙讓我的實驗室生活能順利渡過。所有的一切，都是我在交大的寶貴回憶。最後，我要感謝我的家人們，我的父親、母親、弟弟，你們的默默支持，是我能夠完成學業的最大動力。在此，我謹把這篇論文獻給所有愛我與我愛的人。

(4)

i

適用於高解析度靜態影像與視訊應用之

H.264/MPEG-4 AVC 框內編解碼器設計

研究生：古君偉

指導教授：張添烜博士

國立交通大學

電子研究所

摘要

近幾十年來，數位視訊科技已被廣泛地使用並成為生活中不可或缺的一部分。隨著數位訊號處理的發展，以及對較佳編碼效能的要求，H.264/AVC 被認為是次世代的國際視訊編碼標準。和早期的標準相比，在強大的編碼技術下，新的視訊標準可以明顯地降低資料量但仍維持視訊品質。在這些技術中，空間性的框內編碼是具有高編碼效率的新工具。高品質的編碼效率使得框內編碼不但適用於單張畫面的視訊編碼，也適用於靜態影像壓縮，甚至可以和最新的影像編碼標準 JPEG2000 相比擬。然而，因為複雜的編碼技術，框內編碼的運算複雜度也比之前的標準高的多。因此，如何減少複雜度並設計一個高效能的框內編碼器或解碼器，而不會造成太多的效能衰減，是個重要的課題。在本篇論文中，我們提供一個框內編解碼器和一個快速框內編碼器的兩個硬體實現來解決此問題。首先，我們提出一個演算法層次和系統層次皆最佳化的基本規格框內編解碼器架構。為了在近似相同的視訊品質下減少硬體成本和增加處理速度，以硬體為目的的演算法移除了佔空間的平面預測並以更準確的代價函數來加強模式決定過程。在架構設計方面，除了快速的模組實現外，由巨圖塊層次的管線化型式和三

(5)

ii 個排程技術來安排編碼過程，以避免閒置的週期並改善資料生產量。整個編解碼器設計最後可以分別在117MHz 時脈下支援高解析度 1280x720 尺寸 30fps 的即時視訊編碼，以及在58MHz 下支援高解析度 1920x1080 尺寸的視訊解碼。另一個成果，是具有快速模式決定演算法和可變像素平行化技術，針對低功率問題設計的基本規格框內編碼器。經由提出修改後的三步驟演算法流程，模式決定的過程可以被縮短。此外，可變像素平行化的資料路徑也可以有效地節省約一半處理週期，並導致較低的頻率需求。在交錯排程的技術和三個低功率考量的策略下，新設計比之前的設計有較小的晶片面積，並只需61MHz 即可支援高畫質 1280x720 尺寸 30fps 的即時視訊編碼。簡而言之，我們對於 H.264/AVC 框內編碼的貢獻可以分成兩個部分。一個貢獻是框內編解碼器，在最小的硬體成本和處理速度的改進下，整合了編碼和解碼的過程。另一個貢獻是快速框內編碼器，特性包括了降低運算複雜度，壓制頻率需求，以及對於低功率課題的策略。

(6)

iii

Design of H.264/MPEG-4 AVC Intra Codec for High

Definition Size Still Image and Video Applications

Student:

Chen-Wei

Ku Advisor:

Dr.

Tian-Sheuan

Chang

Institute of Electronics

National Chiao Tung University

Abstract

For the recent decodes, digital video technology has been popularly used and become a necessary part in our daily life. With the development of digital signal processing and demand of better coding performance, H.264/AVC is regarded as the international video coding standard for the next generation. The new standard can achieve significant bitrate reduction compared to earlier standards but still maintains the video quality with its powerful coding techniques. In these techniques, the spatial intra coding is a newly proposed coding tool with high coding efficiency. The high-quality coding efficiency makes intra coding not only suitable for single-picture video coding but also for still image compression, and even competitve with the latest image coding standard like JPEG2000. However, due to the complicated coding techniques, computational complexity of intra coding is much higher than previous standards as well. Thus, how to reduce the complexity and to design a high-efficient intra coder or decoder without much performace degradation is an important issue. In this thesis, we contribute two hardware implementation of an intra frame codec and a fast intra frame encoder to solve this question.

(7)

iv

and system-level optimization. To reduce hardware cost and increase processing speed while providing nearly the same video quality, the hardware-oriented algorithm removes the area-costly plane prediction and enhances the mode decision process with more accurate cost function. In the architecture design, in addition to fast module implementation the process is arranged by the macroblock-level pipelining style together with three scheduling techniques to avoid idle cycles and improve data throughput. The whole codec design finally can support high definition 1280x720 size 30fps real-time video coding at 30fps when clocked at 117MHz and high definition 1920x1080 size decoding at 58MHz respectively.

The other work is the baseline intra frame encoder targeted on low-power issues with techniques like fast mode decision algorithm and vairable-pixel parallelism. The mode decision process is shortened by the proposed modified three-step algorithm. Besides, the vairable-pixel parallel datapath can also effectively save almost half of processing cycles and lead to lower frequency requirement. With the technique of interlaced scheduling and three strategies for low-power consideration, the new design has smaller chip area relative to previous designs and can support high definition 1280x720 size 30fps real-time video coding at only 61MHz.

In brief, our contributions to H.264/AVC intra coding can be divided into two parts. One contibution is the intra frame codec, which integrates both encoding and decoding processes with minor hardware cost and improvement of processing speed. The other contribution is the fast intra frame encoder, with features of reduction of computational complexity, suppression of frequency requirement, and strategies for low-power issues.

(8)

v

Content

Chapter 1 Introduction ... 1

1.1. Motivation ... 1

1.2. Thesis Organization ... 4

Chapter 2 Overview of H.264/AVC Standard... 5

2.1. Fundamental of H.264/AVC ... 5

2.1.1. Coding Structure... 5

2.1.2. Features of Standard ... 7

2.1.3. Profiles... 10

2.2. Components of Baseline Intra Coding ...11

2.2.1. Intra Prediction ...11

2.2.2. Cost Generation and Mode Decision... 12

2.2.3. Transform ... 14

2.2.4. Quantization ... 15

2.2.5. Entropy Coding ... 15

Chapter 3 H.264/AVC Intra Frame Codec ...18

3.1. Hardware Oriented Algorithm... 18

3.1.1. Enhanced SATD Function for Mode Decision... 18

3.1.2. Intra Plane Mode Removal... 20

3.1.3. Simulation Results... 22

3.2. System Level Scheme... 27

(9)

vi

3.2.2. Macroblock Level Pipelining ... 29

3.3. Architecture Design of Intra Codec... 30

3.3.1. Overall Architecture ... 30

3.3.2. Schedule of Codec... 33

3.3.3. Intra Prediction Generation Unit ... 35

3.3.4. Transform Unit ... 37

3.3.5. Quantization and De-quantization... 38

3.3.6. Cost Generation and Mode Decision Unit... 39

3.3.7. Reconstruction Path... 40

3.3.8. Memory Organization... 41

3.3.9. CAVLC Codec... 42

3.4. Implementation Results ... 44

3.4.1. Gate-count and Layout ... 44

3.4.2. Comparison... 46

Chapter 4 Fast H.264/AVC Intra Frame Encoder ...48

4.1. Fast Algorithm for Intra Prediction ... 48

4.1.1. Survey of Fast Algorithm ... 48

4.1.2. Modified Fast Algorithm for Intra Prediction ... 50

4.1.3. Simulation Results... 53

4.2. Architecture Design of Fast Intra Encoder ... 57

4.2.1. Overall Architecture ... 57

4.2.2. Scheduling of Encoder ... 59

4.2.3. Eight-pixel Parallel Datapath... 60

(10)

vii

4.2.5. Strategies for Low Power Design... 63

4.3. Implementation Results ... 65

4.3.1. Gate-count and Layout ... 65

4.3.2. Comparison... 67

(11)

viii

List of Figures

Fig. 1 Hierarchy of video data components ... 6

Fig. 2 Basic structure diagram of H.264/AVC encoder ... 7

Fig. 3 Basic structure diagram of H.264/AVC decoder ... 7

Fig. 4 Three profiles of H.264/AVC ... 10

Fig. 5 Nine modes for intra 4x4 prediction... 12

Fig. 6 Four modes for Intra 16x16 or 8x8 prediction ... 12

Fig. 7 Flow diagram of most probable mode selection... 13

Fig. 8 (a) Prefix and suffix bitstrings, (b) exp-Golomb bitstrings, (c) mapping for signed bitstrings... 16

Fig. 9 Example of CAVLC Coding... 17

Fig. 10 Four categorized types of intra prediction modes ... 21

Fig. 11 Intra plane mode for (a) 16x16 (b) 8x8 predictions... 22

Fig. 12 RD curves of [10] and proposed algorithm for sequence “Stefan” ... 24

Fig. 13 RD curves of [10] and proposed algorithm for sequence “Mobile”... 24

Fig. 14 RD curves of [10] and proposed algorithm for sequence “Paris” ... 25

Fig. 15 RD curves of [10] and proposed algorithm for sequence “Akiyo” ... 25

Fig. 16 RD curves of [10] and proposed algorithm for sequence “Foreman” ... 26

Fig. 17 RD curves of [10] and proposed algorithm for sequence “Coastguard” ... 26

Fig. 18 Ping-pong architecture with macroblock level pipelining for encoder ... 29

Fig. 19 Proposed architecture of baseline intra frame codec ... 30

Fig. 20 Encoder dataflow of the design ... 31

(12)

ix

Fig. 22 Pipelined schedule for codec design... 35

Fig. 23 Reconfigurable datapath of intra prediction generation unit ... 35

Fig. 24 Examples of operations for four intra prediction modes ... 36

Fig. 25 Hardware architecture of transform unit [20]... 38

Fig. 26 Quantization and de-quantization unit... 39

Fig. 27 Cost generation and mode decision unit... 40

Fig. 28 Memory organization in intra codec... 41

Fig. 29 Architecture for CAVLC encoder... 42

Fig. 30 Architecture for CAVLC decoder... 43

Fig. 31 Layout of the codec chip ... 45

Fig. 32 Decision flow of fast three-step algorithm for intra prediction... 50

Fig. 33 Fast three-step algorithm in pipeline structure ... 51

Fig. 34 Decision flow of modified three-step algorithm for intra prediction ... 52

Fig. 35 Modified fast three-step algorithm in pipeline structure ... 52

Fig. 36 RD curves of [10] and proposed fast algorithm for sequence “Stefan” ... 54

Fig. 37 RD curves of [10] and proposed fast algorithm for sequence “Mobile” ... 55

Fig. 38 RD curves of [10] and proposed fast algorithm for sequence “Paris”... 55

Fig. 39 RD curves of [10] and proposed fast algorithm for sequence “Akiyo”... 56

Fig. 40 RD curves of [10] and proposed fast algorithm for sequence “Foreman” ... 56

Fig. 41 RD curves of [10] and proposed fast algorithm for sequence “Coastguard” . 57 Fig. 42 Proposed architecture of encoder with fast algorithm... 58

Fig. 43 Pipelined schedule for fast encoder when best luma mode is selected to 16x16 ... 60 Fig. 44 Pipelined schedule for fast encoder when best luma mode is selected to 4x4 60

(13)

x

Fig. 45 Eight-pixel parallel intra prediction generator... 61

Fig. 46 Eight-input eight-output 4x4 transform unit... 62

Fig. 47 Memory organization in fast encoder... 63

(14)

xi

List of Tables

Table 1 Average bitrate saving for video streaming... 3

Table 2 Quantization factors in H.264/AVC... 15

Table 3 De-quantization factors in H.264/AVC ... 15

Table 4 Probability distribution of 16x16 modes in different sequence with 300 I-frames at QP=28 ... 22

Table 5 Comparison among original code [10], SAITD algorithm in [14], and the two proposed algorithm for coding of 300 Intra frames... 23

Table 6 Data throughput for different video size... 27

Table 7 Frequency for N-pixel parallel encoder... 28

Table 8 Frequency for N-pixel parallel decoder... 28

Table 9 List of gate count for proposed design ... 44

Table 10 Information for the chip... 45

Table 11 Comparison among [13], [21], and this work... 46

Table 12 Comparison among original [10], modified algorithm, and proposed fast algorithm combined of three techniques for 300 Intra frames ... 53

Table 13 List of gate count for fast encoder... 65

Table 14 Information for the encoder chip... 66

(15)

1

Chapter 1 Introduction

With the demand of higher video quality and lower bitrate, and feasibility of fast growing semiconductor processing, a new video coding standard is developed by the Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG as the next generation video compression standard, which is known as H.264 or MPEG-4 Part 10 Advanced Video Coding (AVC) [1]. In comparison with existing video standards such as MPEG-2 [2] and MPEG-4 [3], the latest standard can improve the coding efficiency by up to 50% while still keep the video quality [4] with various newly introduced coding tools.

1.1. Motivation

Earlier video coding standard MPEG-2, also known as H.262, was developed in the last decade as the extension of prior MPEG-1 [5]. This standard is still popularly used in many applications of our daily life, such as transmission of TV signals, satellite communication, cable, and storage of high-quality video signal on DVDs. However, with the popularity of high definition TV and flexibility of multimedia services, present standard cannot afford the requirement of video quality and real-life applications. In addition, transmission capacity in some transmission media such as cable modem and DSL is much lower than in broadcast channels, which also limits the utilization of MPEG-2.

For videoconference and multimedia streaming service, H.263 [6] and its later enhancement H.263+ [7] were evolved by ITU-T to deal with the low-bitrate video coding in telecommunication application. The other standard MPEG-4 was also

(16)

2

launched in the recent years to address the future multimedia applications like interactive TV and internet video. The MPEG-4 standard consists of more parts besides traditional video and audio system. Its video standard allows coding, scalability, and access to an individual object, and can achieve better coding efficiency relative to prior standards. Since MPEG-4 video standard is built on the same coding structure of MPEG-2 and added with new coding tools, it offers modest coding gain at the expense of a modest increase in complexity [4]. As a result, the increase of complexity should be only justified for object-based video coding but not for nature rectangular video applications.

To formulate a new standard of next generation video coding, the joint team of ITU-T VCEG and ISO MPEG was established to co-develop it for natural video. The newest international video coding standard, well-known as H.264/AVC, is approved by ITU-T as Recommendation H.264 and by ISO/IEC as International Standard 14496-10 MPEG4 Part 10 Advanced Video Coding. The H.264/AVC is designed for technical solution of various application areas, for example, broadcast system over cable or satellite, internet video, interactive storage on optical devices, wireless and mobile network, and multimedia streaming service. To satisfy the flexibility of multiple applications, H.264/AVC can adjust the coding complexity depending on the different profiles defined in the standard.

The coding architecture of H.264/AVC standard is different with that of existing MPEG-2 standard and has noticeable improvement in both bitrate decrease and preservation of decoded video quality [8]. Table 1 presents the average bitrate saving of four popular video standards, where the bitrate decrease in H.264/AVC relative to MPEG-4, H.263, and MPEG-2 are 39%, 49%, and 64% respectively.

(17)

3

Table 1 Average bitrate saving for video streaming

The significant improvement in H.264/AVC is caused by various enhanced and new coding techniques, which can achieve higher video quality and better compression rate than any other standard. These techniques include variable block size and multiple reference pictures motion estimation/compensation, simplified small block size integer transform, quarter-sample accuracy in motion vectors, in-loop deblocking filter, directional spatial-domain intra prediction, context adaptive entropy coding, and arithmetic entropy coding.

In the above mentioned techniques, spatial-domain intra prediction is a new feature in H.264/AVC. Previous standard, MPEG-4, uses the Intra-DC and Intra-AC technique to encode an I-frame with just encoding the difference between neighboring transformed blocks. The newly introduced intra coding method in H.264/AVC takes advantage of the relationship of correlation among adjacent blocks to reduce data correlation of the encoding picture. It predicts the currently coded block with pixel values from neighboring blocks with various directions and only encodes the residues. The use of spatial-domain prediction is able to achieve higher coding efficiency in the single frame coding than that in the previous standards, and even competitive with the latest still image coding standard, JPEG2000 [9].

As a result, the intra frame only coding and decoding is very suitable for applications that do not need or cannot afford the inter prediction capability. The hardware of intra frame codec can be used in portable consumer products like digital video recorder or

(18)

4

digital still camera. For this application, a design with low power consumption and low memory cost are also necessary to fit the demand of practicability.

1.2. Thesis Organization

This paper is organized with five parts. Chapter 1 gives the introduction and motivation of this work. Then in Chapter 2 , a brief overview of H.264/AVC standard and its intra coding is given. Chapter 3 presents an architecture design of H.264/AVC intra frame codec and the hardware oriented algorithm. In Chapter 4 a proposed intra frame encoder with fast prediction technique and lower working frequency is implemented. Finally a conclusion remark is given in Chapter 5 .

(19)

5

Chapter 2 Overview of H.264/AVC Standard

Earlier standards like MPEG-1 and MPEG-2 have enabled many popular consumer products such as video CDs and DVDs. As their successor, H.264/AVC is created more powerful in coding performance and more flexible in all kinds of applications. With the highly developed signal processing and semiconductor technology, many complicated and computationally intensive coding tools can be supported efficiently in H.264/AVC standard. These progressive tools have ability to improve the coding efficiency obviously but still maintain the decoded video quality.

2.1. Fundamental of H.264/AVC

2.1.1. Coding Structure

The coding structure of H.264/AVC is similar to those of previous standards and built on the commonly used motion estimation and transform coding structure. This standard supports the 4:2:0 format with 8-bit sample precision for pixel values and encodes the video by picture order. A picture, as well as an interlaced field or a non-interlaced frame, can be partitioned into several slices, and each slice consists of a series of macroblocks. A macroblock is a primary unit for video coding and includes one 16x16 luminance (luma) and two 8x8 chrominance (chroma) components. The macroblock can further be separated into 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 sub-macroblocks for motion estimation/compensation, where the 4x4 one is called a block. The hierarchy of data organization in H.264/AVC is shown in Fig. 1.

(20)

6

Fig. 1 Hierarchy of video data components

Fig. 2 shows the basic structure diagram of H.264/AVC encoder, and Fig. 3 shows the decoder. The encoding flow first performs the intra prediction in I-slice or motion estimation in P-slice from some reference pictures. Residual values, the difference between predicted values and original ones, are then sent to forward transform unit and quantized after inter or intra prediction. The quantized coefficients, syntax, motion vectors, and other coding information are further coded by entropy coding. In addition, the quantized coefficients are also reconstructed through de-quantization, scaling, inverse transform, and motion compensation as the reference for next slice processing. To decode a slice, residual values are recovered through entropy decoding, de-quantization, and inverse transform in proper order. Decoded slices can be reconstructed by adding residual values to established data from motion compensation for P-slice or intra mode prediction for I-slice. Detailed coding flow can refer to [1] or [4].

(21)

7

Entropy Coding Scaling & Inv.

Transform Control Data Quant. Transf. coeffs Motion Data Intra/Inter Coder Control Motion Estimation Transform/ Scal./Quant.

-Input Video Signal Split into Macroblocks 16x16 pixels Intra frame Prediction De-blocking Filter Motion Compensation

Fig. 2 Basic structure diagram of H.264/AVC encoder

Scaling & Inv. Transform Motion-Compensation Decoded coeffs Motion Data Intra/Inter Intra-frame Prediction De-blocking Filter Output Video Signal Entropy Decoding

Fig. 3 Basic structure diagram of H.264/AVC decoder

2.1.2. Features of Standard

The H.264/AVC standard adopts many new coding tools which are never introduced in the earlier standards, and it also enhances previous techniques to obtain better coding

(22)

8

efficiency. These features significantly improve the coding performance and make the standard satisfy every technical application such as broadcast system and internet video. Some of the important features in H.264/AVC are introduced in the following.

1.Variable block-size motion estimation and compensation

The standard has more flexibility in selection of block sizes and shapes for motion estimation and compensation than any previous standard. Seven kinds of block sizes are introduced, including 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4. This helps to enhance the efficiency of coding of irregularly shaped objects or background behind moving objects.

2.Quarter-sample-accurate motion vector

Most of the previous standards enable half-sample motion vector accuracy, but H.264/AVC improves it by adding quarter-sample motion vector accuracy, which is first found in the advanced profile of the MPEG-4 Visual (Part 2) standard. However, this standard further reduces the complexity of the interpolation processing to simplify the computation.

3.Multiple reference picture motion estimation and compensation

In MPEG-2 and its following standards, only one previous picture can be used to predict the values in the incoming picture. The H.264/AVC standard enlarges the selection range of reference pictures from one to more for better coding efficiency. Thus, motion vectors across multiple reference pictures are allowed. In addition, the standard also allows the bi-direction prediction coding which uses both previous and next pictures as reference ones.

(23)

9

4.Spatial-based directional intra prediction coding

In previous standards, such as MPEG-1 and MPEG-2, I-pictures are directly coded. MPEG-4 Visual standard [3] adopts the Intra-AC and Intra-DC prediction for coding of I-pictures, which utilizes neighboring transformed blocks to perform the prediction and residual coding. However, these coding methods do not take advantage of the correlation among adjacent neighboring blocks. Thus, a spatial-based prediction technique with directional pixel mapping is presented in H.264/AVC for I-picture coding before transform, which uses the reconstructed neighboring pixels to perform the prediction with modes from different directions. With this technique, the coding efficiency for I-pictures can be improved effectively.

5.Small block size integer transform

All of the prior video standards use a transform block size of 8x8, while the new H.264/AVC uses a smaller transform size of 4x4. This allows the encoder to represent signals in a more locally-adaptive fashion and reduces the artifacts caused by the edges of different pixels.

6.In-loop deblocking filter

Block-based video coding may raise the blocking artifacts due to both prediction and residual difference coding of the decoding process. The solution to this problem is to use an adaptive deblocking filter which can improve the resulting video quality well. Instead of building as an optical feature in H.263+, in H.264/AVC the deblocking filter is positioned in the motion compensation loop as an in-loop filter so that quality improvement in a single picture can be extended to the inter-picture prediction as well.

(24)

10

For compression of quantized transform coefficients, an efficient variable-length coding (VLC) method is used in H.264/AVC. The VLC coding is previously used in the existing standards but enhanced in this standard with context adaptivity to increase the coding performance.

8.Arithmetic entropy coding

Another coding method known as context-adaptive binary arithmetic coding (CABAC) is also included in H.264/AVC as the advanced entropy coding. This arithmetic coding can achieve higher efficiency than VLC coding due to the effective probability model of symbol occurrence.

2.1.3. Profiles

Fig. 4 Three profiles of H.264/AVC

There are three profiles defined in H.264/AVC as shown in Fig. 4, which are baseline, main, and extended profiles. Baseline profile includes basic coding tools and features,

(25)

11

such as I-slice, P-slice, quarter-sample accurate motion vector, deblocking filter, and CAVLC, and is primarily used for lower-cost applications with demand of less computing resources. Thus, this profile is widely used in videoconferencing, internet multimedia, and mobile applications. Main profile is used as the mainstream consumer profile for applications of broadcast system and storage devices. It contains most of the features in baseline profile and other advanced techniques, like adaptive frame/field coding, interlaced coding, weighted prediction, B-slice, and CABAC. The main profile can achieve better performance in both bitrate saving and video quality while needs much more computation effort. Finally the extended profile, which includes all the features in baseline profile and main profile except CABAC, is intended as the streaming video profile and has relatively high compression capability with extra tricks for robustness to data losses and server stream switching.

2.2. Components of Baseline Intra Coding

2.2.1. Intra Prediction

Spatial-domain prediction is the main feature of H.264/AVC intra coding. There are two kinds of intra prediction for luma components, nine 4x4 prediction modes or four 16x16 prediction modes, which are shown in Fig. 5 and Fig. 6 respectively. The 4x4 prediction modes use the neighboring thirteen reconstructed samples denoted from A to M in Fig. 5 to predict the block pixels with eight different directions and one average value. For 16x16 prediction modes, the values are predicted from the 32 adjacent boundary pixels of upper and left macroblocks. Similar procedures are also applied to the chroma components where four 8x8 prediction modes are used with 16 neighboring pixels.

(26)

12

Fig. 5 Nine modes for intra 4x4 prediction

Fig. 6 Four modes for Intra 16x16 or 8x8 prediction

2.2.2. Cost Generation and Mode Decision

The best mode decision for intra prediction in [10] can be either the time consuming rate distortion optimization (RDO) or just much simpler cost accumulation. RDO uses the weighted sum of actual encoded bitrate and the reconstructed samples to produce distortion. Though it can achieve better performance, it is computationally intensive.

An alternative way is using cost accumulation. Two generally used mode decision methods for cost generation are available in [10], sum of absolute difference (SAD) and sum of absolute transform difference (SATD). The formulas are defined as follows.

) ( 4 1 SAD m Qp C = + λ (1) ) ( 4 2 SATD m Qp C = + λ (2)

(27)

13

∑∑

= = − = 4 1 4 1 i j ij ij p s SAD (3)

∑∑

= = − = 4 1 4 1 ) ( i j ij ij p s T SATD (4)

In (1) and (2), symbol λ(Qp) stands for the lambda values which are from the approximated exponential function depending on the quantization parameters, and the variable m indicate whether the current mode is the most probable mode. If the most probable mode is detected, m is equal to 0, otherwise it sets to 1. The decision flow of most probable is illustrated in Fig. 7. In (3) and (4), sij and pij are the (i, j)th elements of source block and predicted block respectively. The function T(x) in (4) represents the 4x4 discrete Hadamard transform (DHT), as shown in (5), where symbol X indicates the residual block.

Find upper and left block modes Boundary block? Most probable = DC mode Up mode < Left mode? Most probable = Up mode Most probable = Left mode Yes No Yes No

(28)

14 2 / 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ) ( ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − = X X T ₍₅₎

For every four blocks in Z-scan order of a macroblock, another lambda value, 6λ (Qp)+0.5, is added to cost value for adjusting the weight. The best mode is finally decided by comparing the summarized cost value of sixteen blocks in the 4x4 prediction to the best mode of the 16x16 prediction.

2.2.3. Transform

In H.264/AVC, a traditional 8x8 floating-point discrete cosine transform (DCT) in previous standards is replaced by an approximated 4x4 DCT. The transform can be divided into two parts, 4x4 integer transform and fractional scalar multiplication factors that are further merged into the quantization stage. The forward 4x4 integer transform with quantization is illustrated in (6), and the inverse one with de-quantization is illustrated in (7). In which, the matrices with factors a and b denote the scalar multiplication. Notice that matrix X in (6) is post-scaled by quantization and matrix Y in (7) is pre-scaled by de-quantization. For a macroblock predicted by the 16x16 or 8x8 modes, the DC value of each transformed block is further processed by 4x4 DHT or 2x2 DHT respectively, as shown in (5) or (8) ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⊗ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − = 4 / 2 / 4 / 2 / 2 / 2 / 4 / 2 / 4 / 2 / 2 / 2 / 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 b ab b ab ab a ab a b ab b ab ab a ab a X Y ₍₆₎ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⊗ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − = 2 / 1 1 1 1 1 1 1 1 1 2 / 1 2 / 1 1 1 1 1 1 2 / 1 1 1 1 1 1 2 / 1 1 1 1 2 / 1 1 2 / 1 1 1 1 2 2 2 2 2 2 2 2 b ab b ab ab a ab a b ab b ab ab a ab a Y XR (7)

(29)

15 ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − = 1 1 1 1 1 1 1 1 ) (X X T_C (8)

2.2.4. Quantization

In the quantization stage, there are 52 values of quantization parameters (QPs) and corresponding quantization steps supplied in H.264/AVC standard. The steps are doubled for increase of every six numbers in QPs. The scaling factors mentioned in Subsection 2.2.3 are incorporated into these quantization factors to avoid the computational complexity in 4x4 transform stage. These factors for quantization and de-quantization are implemented in [10] as a multiplication factors and shown in Table 2 and Table 3 respectively.

Table 2 Quantization factors in H.264/AVC

Table 3 De-quantization factors in H.264/AVC

2.2.5. Entropy Coding

There are two entropy coding methods supported in H.264/AVC baseline profile.

(30)

16

The standard uses the exponential Golomb coding (Exp-Golomb) to encode the syntax elements of coding information, such as mode and type. Exp-Golomb codes consist of a prefix part and a suffix part with a string of bits as shown in Fig. 8. The signed values of syntax elements are assigned to the unsigned code number for code mapping.

Bitstring Code Num Code Num Syntax Value

1 0 0 0

Bitstring Form Range 0 1 0 1 1 1

1 0 0 1 1 2 2 -1 0 1 x1 1 - 2 0 0 1 0 0 3 3 2 0 0 1 x1 x0 3 - 6 0 0 1 0 1 4 4 -2 0 0 0 1 x2 x1 x0 7 - 14 0 0 1 1 0 5 5 3 0 0 0 0 1 x3 x2 x1 x0 15 - 30 0 0 1 1 1 6 6 -3 0 0 0 0 0 1 x4 x3 x2 x1 x0 31 - 62 0 0 0 1 0 0 0 7 7 4 … … … … (a) (b) (c)

Fig. 8 (a) Prefix and suffix bitstrings, (b) exp-Golomb bitstrings, (c) mapping for signed bitstrings

2.Context adaptive variable length coding (CAVLC)

In baseline profile, the standard uses CAVLC for coding the quantized samples to bitstream. Fig. 9 illustrates an example of CAVLC coding in flow diagram. The following items are coded in a proper order: number of nonzero coefficients, sign marks of trailing ones, levels of remaining nonzero coefficients, number of total zeros, runs of zeros between nonzero coefficients. The coefficients should be scanned in the reversed zigzag order for CAVLC coding, but for decoding, the process is the reverse of the encoding one. If all coefficients within an 8x8-size block are zero, the coding process will skip them and assign a special flag, coded block pattern (CBP), to denote such case.

(31)

17

(32)

18

Chapter 3 H.264/AVC Intra Frame Codec

H.264/AVC is popularly regarded as the video standard in the next generation to replace the existing MPEG-2 standards. The spatial-domain intra coding is a newly supported technique which is not only suitable for moving video coding but also still image compression. However, the common digital signal processors are hard to afford its high computational complexity and large data throughput.

In this chapter, a parallel H.264/MPEG-4 AVC baseline profile intra frame codec supporting both processes of encoder and decoder is proposed for digital camera and video application. This work is mainly based on the previous architecture [11] and modified to fit the decoding procedure. The proposed chip has ability to support high definition (HD) size 720p (1280x720 4:2:0) at 30 fps real-time video encoding at 117MHz and 1080p (1920x1080 4:2:0) at 30 fps video decoding at 58MHz. When clocked at the height frequency 125MHz, this design can process encoding of 29.62M pixels still image per second or decoding of 135.60M pixels. The research result of this work is also published in [12].

3.1. Hardware Oriented Algorithm

3.1.1. Enhanced SATD Function for Mode Decision

In determining the coding performance of intra-only H.264/AVC, the cost function for mode decision is the most important part. To find a best matched prediction mode is to use RDO. Though RDO can provide the best performance, its complexity hinders its use in the hardware design. Thus, the SATD method is adopted to calculate the costs.

(33)

19

However, how to determine the transform for SATD computation will become the main issue now. The transform choice used in SATD should be computationally simple but also effective to estimate the energy of the signals. In [10], a pure transform of 4x4 DHT is adopted for mode decision, but it is far from the real transform used in the whole encoding process. A better transform choice for SATD shall approximate the effect of transform and quantization used in the H.264/AVC encoding process to estimate the real bitrate. Therefore, previous works [13][14] use the 4x4 integer transform as the choice. Although their approaches can achieve better performance than DHT does, it’s still not good enough. That is because that the fractional multiplication factors do not be taken into consideration. A complete transform function in H.264/AVC shall include both the integer transform and multiplication factors in the quantization formula as shown in (6) and (7). However, to incorporate these factors into the cost function directly will cost a lot of computation because they are not simple integer numbers. Besides, these factors cannot be directly derived from the formula since the quantization parameters shall also be included.

To solve these problems, this work adopts the cost function proposed in [11] that combines the integer transform and simplified multiplication factors. The simplified multiplication factors are derived from quantization coefficients shown in Table 2 and Table 3 . Derivation from the quantization coefficients enables the consideration of both effects of transform and quantization. From these tables, we can obtain the required scalar factors by approximating the relationship among the reciprocal of de-quantization coefficients and simplifying them to integers for reduction of computational complexity as shown in (9) and (10).

(34)

20

1/dequant_coef: p(0,0)-1 : p(0,1)-1 : p(1,1)-1~= 30:25:20 (10) In (9) and (10) the symbol p(x,y) represents the quantization and de-quantization coefficients of different positions in Table 2 and Table 3 respectively. The scaling factors derived from the de-quantization table are adopted by considering the final performance and implementation cost, as shown in (11). In this formula, division by 32 is added to avoid enlargement of cost values, which can be carried out with simpler low-cost wiring in the hardware design. As a result, the cost generation function is able to estimate the energy of residuals after the transform and quantization function more accurate than other methods while still keep computation simple and suitability for hardware implementation. It can provide better quality than that in [10] and can be used to compensate the quality loss of the plane mode removal discussed in Subsection 3.1.2.

32 / 20 25 20 25 25 32 25 32 20 25 20 25 25 32 25 32 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 ) ( ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⊗ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − = X X C (11)

3.1.2. Intra Plane Mode Removal

The various intra prediction modes can further be organized systematically into four types according to their prediction properties and computational complexity. These types are illustrated in Fig. 10. The bypass type is easy to be implemented since the prediction samples are the same as boundary pixels. In the average type, neighboring eight pixels (for 4x4 prediction) or 32 pixels (for 16x16 prediction) are summarized and divided into an average value for all prediction samples. The linear type contains most of the 4x4 prediction modes with directional approach, and the samples are linearly interpolated by boundary pixels. Finally in the bilinear type, also known as plane prediction, samples are derived by the approximation of bilinear transform. Though

(35)

21

being simplified to be only integer arithmetic operations, the plane mode is still much more computational complex than other modes. Besides, it is also hard to reuse its results for other prediction, and occupies almost half of the area in the intra prediction unit. The detail computation of plane prediction can refer to Fig. 11.

Bypass

Linear

Average

Bilinear

Fig. 10 Four categorized types of intra prediction modes

A solution to this problem is to eliminate the plane mode from intra prediction and replace it with other modes. This may raise the issue of performance loss. Table 4 shows the probability distribution of the 16x16 prediction modes in different sequences. Macroblocks predicted in plane mode is only 4.2% in average and not larger than 5.7% except the sequence “Akiyo” which contains much smoother texture. However, after simulation we found that prediction with plane mode only reduces about 1% of bitrate than that without plane mode for these video sequences. This 1% bitrate loss can be easily compensated by the enhanced cost function proposed in Subsection 3.1.1. With the modification, we can achieve almost the same result as [10] but save a lot of

(36)

22

computation and hardware cost.

Luma 16x16

h= Σx[ p(7+i,-1) - p(7-i,-1) ] i=1~8 v= Σy[ p(-1,7+i) - p(-1,7-i) ] i=1~8 a= 16*[ p(-1,15) + p(15,-1) ]

b= ( 5h+32 )>>6 c= ( 5v+32 )>>6

Pred= [ a + b(x-7) + c(y-7) + 16 ]>>5

Chroma 8x8

h= Σx[ p(4+i,-1) - p(4-i,-1) ] i=1~4 v= Σy[ p(-1,4+i) - p(-1,4-i) ] i=1~4 a= 16*[ p(-1,7) + p(7,-1) ]

b= ( 17h+16 )>>5 c= ( 17v+16 )>>5

Pred= [ a + b(x-3) + c(y-3) + 16 ]>>5

Fig. 11 Intra plane mode for (a) 16x16 (b) 8x8 predictions

Table 4 Probability distribution of 16x16 modes in different sequence with 300 I-frames at QP=28

Total ratio Veritical Horizontal DC Plane Mobile 3.3% 0.8% 1.0% 1.3% 0.2% Coastguard 10.7% 0.8% 3.8% 4.6% 1.6% Stefan 20.7% 3.4% 12.8% 2.0% 2.5% Paris 15.5% 3.2% 4.6% 4.4% 3.4% Foreman 23.1% 5.1% 4.3% 8.0% 5.7% Akiyo 47.5% 5.3% 4.8% 25.7% 11.8% Sequence 16x16 modes

3.1.3. Simulation Results

Table 5 illustrates the comparison results for encoding of six CIF-size sequences with all intra frames in different QPs among four algorithms: the original SATD function in [10], SAITD algorithm in [14], enhanced SATD cost function, and the proposed method

(37)

23

combining enhanced function and plane mode removal. In most cases of the simulation, it is obvious that the proposed SATD cost function is able to achieve better coding efficiency than [10] and [14], with almost the same or even better PSNR quality. We can also observe that the enhanced algorithm can reduce average 0.08% bitrate for all sequences. After combining with technique of plane mode removal, the bitrate increase is compensated and not larger than 0.06% in average.

Table 5 Comparison among original code [10], SAITD algorithm in [14], and the two proposed algorithm for coding of 300 Intra frames

SNR Y SNR U SNR V Bit-rate SNR Y SNR U SNR V Bit-rate SNR Y SNR U SNR V Bit-rate SNR Y SNR U SNR V Bit-rate

16 46.38 47.27 47.43 10537.41 +0.10 +0.00 +0.00 +0.11% +0.06 +0.00 -0.01 -0.16% +0.06 +0.00 +0.00 -0.11% 20 42.96 44.43 44.51 8143.18 +0.05 -0.03 -0.03 +0.07% +0.02 -0.03 -0.03 -0.18% +0.02 -0.03 -0.03 -0.12% 24 39.63 41.63 41.65 6189.86 -0.02 -0.14 -0.13 +0.15% -0.04 -0.14 -0.13 -0.22% -0.04 -0.14 -0.12 -0.14% 28 36.41 38.95 38.96 4585.29 -0.02 -0.23 -0.24 +0.15% -0.05 -0.23 -0.24 -0.24% -0.05 -0.23 -0.25 -0.16% 32 33.05 37.12 37.08 3246.41 -0.05 -0.41 -0.43 +0.34% -0.08 -0.41 -0.43 -0.21% -0.08 -0.42 -0.45 -0.15% 36 29.96 35.15 35.07 2218.46 -0.06 -0.48 -0.52 +0.95% -0.10 -0.49 -0.52 -0.09% -0.11 -0.53 -0.55 -0.06% 16 45.93 46.25 46.29 15361.27 +0.04 +0.01 +0.01 +0.05% +0.04 +0.01 +0.01 -0.10% +0.04 +0.01 +0.01 -0.08% 20 42.14 42.87 42.89 12233.46 +0.05 -0.03 -0.04 +0.07% +0.03 -0.03 -0.04 -0.12% +0.03 -0.03 -0.03 -0.09% 24 38.49 39.72 39.68 9487.00 +0.01 -0.11 -0.10 +0.17% -0.01 -0.11 -0.10 -0.14% -0.01 -0.11 -0.10 -0.10% 28 35.04 36.88 36.76 7179.38 +0.03 -0.17 -0.18 +0.26% +0.01 -0.17 -0.18 -0.16% +0.01 -0.16 -0.18 -0.12% 32 31.50 34.89 34.67 5200.07 +0.02 -0.24 -0.24 +0.52% -0.01 -0.24 -0.24 -0.14% -0.01 -0.24 -0.23 -0.09% 36 28.28 32.87 32.60 3572.40 +0.01 -0.31 -0.32 +1.08% -0.05 -0.31 -0.22 -0.05% -0.05 -0.31 -0.22 +0.01% 16 46.16 47.35 47.63 10114.93 +0.08 -0.05 -0.05 +0.12% +0.06 -0.05 -0.05 -0.09% +0.06 -0.05 -0.04 -0.00% 20 42.81 44.69 44.88 7662.99 +0.02 -0.21 -0.16 +0.25% -0.01 -0.21 -0.16 -0.08% -0.01 -0.20 -0.15 -0.01% 24 39.59 41.97 42.12 5742.80 -0.05 -0.35 -0.28 +0.49% -0.09 -0.36 -0.28 -0.06% -0.09 -0.36 -0.28 +0.05% 28 36.49 39.40 39.54 4235.43 -0.05 -0.50 -0.34 +0.70% -0.20 -0.50 -0.34 -0.06% -0.10 -0.52 -0.36 +0.09% 32 33.33 37.51 37.73 3005.78 -0.14 -0.61 -0.52 +0.82% -0.20 -0.61 -0.52 -0.11% -0.20 -0.65 -0.52 +0.08% 36 30.36 35.58 35.78 2055.30 -0.18 -0.64 -0.54 +1.00% -0.25 -0.64 -0.55 -0.13% -0.25 -0.67 -0.56 +0.08% 16 47.34 48.46 49.40 4159.54 +0.16 -0.06 -0.11 +0.95% +0.09 -0.05 -0.12 -0.10% +0.10 -0.06 -0.12 +0.42% 20 44.89 46.88 47.95 2777.52 +0.04 -0.26 -0.33 +0.83% -0.05 -0.26 -0.34 -0.16% -0.05 -0.30 -0.38 +0.18% 24 42.65 44.62 46.03 1941.72 -0.14 -0.35 -0.58 +1.03% -0.23 -0.35 -0.58 +0.18% -0.25 -0.37 -0.64 +0.56% 28 40.33 42.52 43.93 1370.30 -0.18 -0.53 -0.77 +1.49% -0.33 -0.54 -0.76 +0.27% -0.35 -0.63 -0.85 +0.62% 32 37.77 40.82 42.54 963.42 -0.32 -0.53 -0.89 +1.71% -0.53 -0.52 -0.86 -0.66% -0.58 -0.69 -1.09 +0.30% 36 35.28 38.80 40.69 672.96 -0.38 -0.45 -0.78 +2.80% -0.65 -0.50 -0.75 +0.17% -0.70 -0.64 -0.98 +0.28% 16 46.26 47.56 48.57 7665.68 +0.10 +0.03 -0.01 +0.28% +0.08 +0.03 -0.01 -0.14% +0.08 +0.03 +0.00 -0.04% 20 42.90 45.17 46.83 5394.72 +0.13 -0.08 -0.21 +0.59% +0.09 -0.08 -0.21 -0.13% +0.09 -0.08 -0.21 -0.00% 24 39.95 42.87 44.78 3678.28 +0.04 -0.25 -0.33 +1.20% -0.03 -0.24 -0.33 -0.08% -0.03 -0.25 -0.35 +0.07% 28 37.26 40.91 42.79 2467.35 +0.03 -0.32 -0.42 +1.80% -0.06 -0.32 -0.42 -0.08% -0.06 -0.34 -0.46 +0.10% 32 34.61 39.80 41.32 1598.24 -0.05 -0.41 -0.54 +2.62% -0.19 -0.41 -0.54 +0.03% -0.20 -0.46 -0.60 +0.30% 36 32.24 38.61 39.81 1022.62 -0.08 -0.39 -0.49 +4.19% -0.31 -0.39 -0.49 -0.00% -0.32 -0.45 -0.57 +0.53% 16 45.88 48.20 49.05 9454.85 +0.04 +0.06 +0.05 +0.15% +0.04 +0.06 +0.05 -0.20% +0.04 +0.06 +0.05 -0.18% 20 42.13 46.38 47.64 6969.04 +0.10 -0.04 -0.10 +0.30% +0.09 -0.04 -0.10 -0.21% +0.09 -0.05 -0.11 -0.20% 24 38.74 44.64 46.13 4959.03 +0.08 -0.21 -0.12 +0.74% +0.06 -0.21 -0.11 -0.16% +0.06 -0.24 -0.13 -0.15% 28 35.63 43.08 44.72 3437.66 +0.15 -0.27 -0.15 +1.33% +0.12 -0.27 -0.15 -0.11% +0.12 -0.35 -0.18 -0.14% 32 32.74 41.96 43.72 2236.41 +0.06 -0.31 -0.17 +1.99% +0.00 -0.31 -0.16 -0.13% +0.00 -0.44 -0.23 -0.17% 36 30.24 40.82 42.72 1418.33 -0.04 -0.28 -0.11 +2.68% -0.15 -0.27 -0.11 -0.17% -0.14 -0.39 -0.14 -0.03% Coastguard Akiyo Foreman

JM 8.6 [9] Enhanced SATD Cost Function Enhanced SATD Cost Function + Plane_{Mode Removal}

Sequence QP SAITD Algorithm

Stefan

Mobile

(38)

24 Stefan CIF 32 34 36 38 40 42 44 100 130 160 190 220 250 280 Kbits/frame SN R Y (d B) JM 8.6 Proposed

Fig. 12 RD curves of [10] and proposed algorithm for sequence “Stefan”

Mobile CIF 31 33 35 37 39 41 43 170 210 250 290 330 370 410 Kbits/frame S N R Y (d B) JM 8.6 Proposed

(39)

25 Paris CIF 33 35 37 39 41 43 90 120 150 180 210 240 Kbits/frame SN R Y (d B) JM 8.6 Proposed

Fig. 14 RD curves of [10] and proposed algorithm for sequence “Paris”

Akiyo CIF 38 40 42 44 46 48 40 60 80 100 120 140 Kbits/frame S N R Y (d B) JM 8.6 Proposed

(40)

26 Foreman CIF 34 36 38 40 42 44 50 70 90 110 130 150 170 190 Kbits/frame S N R Y (d B) JM 8.6 Proposed

Fig. 16 RD curves of [10] and proposed algorithm for sequence “Foreman”

Coastguard CIF 32 34 36 38 40 42 44 60 90 120 150 180 210 240 Kbits/frame S N R Y (d B) JM 8.6 Proposed

(41)

27

The RD curve diagrams of [10] and our proposed combined algorithm for these six sequences are shown from Fig. 12 to Fig. 17. The QP range for these diagrams is from 20 to 32 except that for “Akiyo” whose range is located from 16 to 28 to clearly show the characteristic of its curve. The curves of our algorithm are very close to the original ones. Especially in the high bitrate coding with lower QPs, the performance is even better. This algorithm-level optimization actually makes the final hardware design not only simpler but also with good video quality.

3.2. System Level Scheme

3.2.1. Analysis of Hardware Complexity

To achieve the throughput for our target of video size, the complexity of hardware shall be first analyzed before design. For H.264/AVC codec, the computational complexity in encoder is much more extensive than that in decoder since encoder computes all prediction modes instead of decoding exactly one. Table 6 shows the data throughput in different video sizes at 30 fps. Thus, for our target HD 720p, it needs data throughput of at least 108,000 macroblocks per second, which is identical to 27.65M pixels.

Table 6 Data throughput for different video size

Mega pixs/sec kilo mbs/sec

QCIF 176 x 144 0.76 2.97 CIF 352 x 288 3.04 11.88 ITU-R 720 x 576 12.44 48.60 SDTV 720 x 480 10.37 40.50 1280 x 720 27.65 108.00 1920 x 1080 62.21 243.00 HDTV

Video Size Data Throughput

(42)

28

intra coding. Simplifying the estimation by neglecting the cycles of data transfer between on-chip and off-chip memory, we only consider prediction cycles. With such assumption, total cycle count in encoding process to predict a macroblock, including one luma and two chroma components, is 3456 (16x16x9+16x16x3+2x16x4x3), where plane modes are removed and other operations are excluded. The necessary frequency is 373.25 MHz for HD 720p size and 839.81 MHz for HD 1080p size in response to the estimated cycles in encoder. This speed requirement is far beyond the generally acceptable range of common processor and hard to be implemented.

Table 7 Frequency for N-pixel parallel encoder

N=1 N=2 N=4 N=16 QCIF 176 x 144 10.26 5.13 2.57 0.64 CIF 352 x 288 41.06 20.53 10.26 2.57 ITU-R 720 x 576 167.96 83.98 41.99 10.50 SDTV 720 x 480 139.97 69.98 34.99 8.75 1280 x 720 373.25 186.62 93.31 23.33 1920 x 1080 839.81 419.90 209.95 52.49 Frequency at N-Parallel (MHz) Video Size HDTV

Table 8 Frequency for N-pixel parallel decoder

N=1 N=2 N=4 N=16 QCIF 176 x 144 1.14 0.57 0.29 0.07 CIF 352 x 288 4.56 2.28 1.14 0.29 ITU-R 720 x 576 18.66 9.33 4.67 1.17 SDTV 720 x 480 15.55 7.78 3.89 0.97 1280 x 720 41.47 20.74 10.37 2.59 1920 x 1080 93.31 46.66 23.33 5.83 HDTV

Video Size Frequency at N-Parallel (MHz)

As a consequence, we apply parallelism technique to reduce the required frequency. Table 7 and Table 8 show the estimation results of encoder and decoder respectively for such pixel parallelism. For encoder design, the suitable choice is to use the four-pixel parallel architecture for HD 720p that runs at frequency of 93.31MHz and needs 864

(43)

29

cycles for one macroblock. With the same condition, the four-pixel parallel decoder only needs 96 cycles at 10.37MHz to decode a macroblock for 720p size. Thus, our decoder can support larger size like HD 1080p at 23.33MHz. Such design target can achieve the real-time requirement while is easy to be implemented as well.

3.2.2. Macroblock Level Pipelining

Previous approach in Subsection 3.2.1 only assumes the cycles for intra prediction. However, more cycles will be required when considering other functions like data transfer between memories and entropy coding. These operations will increase the necessity of extra cycle count for a macroblock and result in higher operating frequency. For example, the CAVLC circuit in [15] takes about 500 cycles to encode a high-quality application video. This will increase the encoder latency to around 1,400 cycles with the frequency of 150MHz. In addition, the zigzag scan in CAVLC unit will also increase the operation cycles since its scan order is quite different than the raster scan used in the prediction engine.

Fig. 18 Ping-pong architecture with macroblock level pipelining for encoder To solve above problems, the macroblock level pipelining is used as shown in Fig. 18 This pipeline partition enables the overlapped execution of intra prediction and entropy coding without large cycle increases. The cycle count of a macroblock depends on the longest latency of each processing unit in pipeline. Besides, this design adopts the

(44)

30

ping-pong architecture with a two-bank memory located between the prediction loop and entropy coding to resolve the ordering problem. In the architecture, currently predicted coefficients after quantization are sent to one memory bank of ping-pong buffer, and coefficients of previously predicted macroblock are stored in the other memory bank and ready for CAVLC coding. These ping-pong buffers are also beneficial for decoding process in the inverse data flow direction for data reordering and processing rate smoothing.

3.3. Architecture Design of Intra Codec

3.3.1. Overall Architecture

Boundary Reg for Intra 4x4 Pixels Selection DCT DHT Pred. DC Reg Cost Generation and Mode Decision Q IQ Boundary Reg for Intra 16x16 Cur/Best Block Reg Source Input IDCT IDHT Rec. FIFO Reg

Rec. DC Reg

Source Buffer 96x32 Single Port External Upper Line Buffer

Ping-pong Coefficient Buffer 104x64x2 Single Port CAVLC Codec Decoded Output Intra Prediction Generator Upper Buffer Controller Bitstream

Output Bitstream Input Rec. Shifter Prediction Phase Reconstruction Phase Bitstream Phase 4 pixels/cycle 1 coef./cycle Most Pb. Reg

Fig. 19 Proposed architecture of baseline intra frame codec

Fig. 19 shows the architecture of the proposed codec derived from previous work [11], which is based on algorithm-level optimization and system-level pipelining mentioned

(45)

31

in Section 3.1 and 3.2. This design is directly corresponding to both the encoding flow in Fig. 2 and the decoding flow in Fig. 3. It can work as an intra frame encoder or a decoder with the alternative of three switch multiplexers shown in Fig. 19. The entire architecture consists of three operation phases: prediction phase, reconstruction phase, and bitstream phase.

The prediction phase is the most important part in this design. It mainly contains intra prediction generator, forward transform, cost generation and mode decision unit, quantization, and some buffers and registers. The reconstruction phase, which is used to reconstruct the decoded data, is composed of inverse transform, de-quantization, and reconstruction FIFO registers. Four-pixel parallelism is used in these two phases to achieve the required throughput. The bitstream phase is separated form previous two phases by the ping-pong buffer and uses the CAVLC codec to perform coding or decoding of bitstream with throughput of at least one coefficient per cycle. In this design, the plane prediction buffer and dual port memory are saved due to the algorithm optimizations in comparison with previous encoder-only design [13].

Boundary Reg for Intra 4x4 Pixels Selection Di ff DCT DHT Pred. DC Reg Q IQ Boundary Reg for Intra 16x16 Cur/Best Block Reg Source Input IDCT IDHT Add Rec. DC Reg Source Buffer 96x32 Single Port Ping-pong Coefficient Buffer 104x64x2 Single Port CAVLC Codec Decoded Output Intra Prediction Generator Upper Buffer Controller Bitstream Output Bitstream Input Rec. Shifter Most Pb. Reg Cost Generation and Mode Decision Rec. FIFO Reg

(46)

32

Fig. 20 shows the encoder dataflow of this design, where the unused datapath is concealed with light tint. First, the intra prediction unit generates prediction values of various modes for predicted block according to schedule control unit. Residuals derived from difference of prediction samples and original data are then transformed by 4x4 integer transform unit. Transformed coefficients are further used to compute cost for decision of the best mode, and block with minimum cost is preserved in the block buffer. The coefficients, which are quantized after chosen as the best mode, are then stored in the ping-pong memory for entropy coding and sent to reconstruction phase to be decoded as boundary samples for next block prediction at the same time.

Di

ff

Ad

d

Fig. 21 Decoder dataflow of the design

The decoder flow is shown in Fig. 21. Unlike the encoding loop in Fig. 20, the decoder is only a direct-through datapath without loop. The coefficients decoded by the CAVLC decoder in last macroblock-level pipeline stage are passed through de-quantization and inverse 4x4 transform to be recovered to residuals. Prediction values according to the mode information decoded from UVLC decoder are acquired from the intra prediction generator and added to the residuals for reconstruction. All the

(47)

33

decoded blocks are further sent to source memory for output and the boundary registers for next prediction. Unused components in decoder such as mode decision, forward transform, quantization, and predictor FIFO buffer are shut down to save power. Detail information for each component is discussed in the following subsections.

3.3.2. Schedule of Codec

Because of variety of prediction modes, the number of process cycles for encoder is much greater than that for decoder and limits the major performance of codec. Another performance bottleneck in encoder is the reconstruction feedback loop since the next 4x4 block cannot start its computation until its boundary samples are reconstructed from previous blocks. This may result in low hardware utilization and longer latency in the prediction phase. In addition to the 4x4 block prediction, when performing 16x16 intra predictions, a macroblock-size buffer could be needed to store the processed 4x4 residual data for later mode decision, which raises the hardware cost.

To solve these issues, we propose three scheduling techniques adopted in the scheduling control unit to solve these data dependency problems and eliminate the requirement of large buffer. These three techniques are as follows:

1.Insertion of the 16x16 and 8x8 predictions

During the empty cycles waiting for reconstructed samples between two intra 4x4 blocks, the 16x16 or 8x8 intra prediction process is inserted into these bubble cycles to pre-compute their costs. Unlike the technique used in [13], the prediction generator predicts four blocks in one 16x16 mode successively instead of one block for four modes in each bubble. This helps to decrease the registers used in accumulating costs for the 16x16 prediction. After processing four blocks, it continues to process the next

(48)

34

4x4 prediction. Thus, utilization of components in the prediction phase is improved.

2.Early start of next block prediction

Since the 4x4 blocks are processed in the Z-scan order, upper and left boundary samples might not be available at the same time for prediction purpose. To avoid this problem and pull the next block processing earlier, we rearrange the processing order of prediction modes such that prediction modes can be started as early as possible if the required data is available. For example, the vertical mode is processed before the horizontal mode since the left boundary pixels are not available. This approach can reduce the idle cycles and thus improve the throughput.

3.Recomputation of 16x16 and 8x8 best modes

For the 4x4 block prediction, we use a small buffer to save the residuals of the best mode. However, when such a strategy applies to 16x16 or 8x8 predictions, a large macroblock-size buffer will be required. To solve this problem, we neglect the data generated in the prediction process and recompute them again for the best mode of 16x16 and 8x8 macroblocks after the prediction process if it is selected as the best mode. This approach may increase the total encoding cycles, but it is still in an acceptable range and can reduce the buffer cost as well.

Fig. 22 shows the pipelined schedule of this codec for processing a macroblock. Based on the above techniques, it takes at most 1,080 cycles to perform an encoding procedure for a macroblock while the best mode of luma components is selected as the 16x16 prediction. However, if the 4x4 prediction is chosen, the optional recomputation cycles between 956 and 1,024 in Fig. 22 can be eliminated, and the total cycles decrease to 1,012. Compared with the previous design [13], this work is able to save 16% of

(49)

35

cycle count. For decoding a macroblock, it takes 236 cycles in the average case.

Fig. 22 Pipelined schedule for codec design

3.3.3. Intra Prediction Generation Unit

(50)

36

Fig. 23 shows the proposed intra prediction generation unit with the removal of plane prediction. The whole thirteen intra prediction modes are organized into four types as described in Fig. 10. Without the complex plane modes, the other modes in these types can easily share and reuse the partial sum of adjacent pixels to compute different values. To support such computation sharing, the datapath of the generator can be reconfigured to handle different modes for 4x4, 16x16, and 8x8 predictions.

Fig. 24 Examples of operations for four intra prediction modes

The operations of this unit for four examples of prediction modes are illustrated in Fig. 24. First, the input pixels are selected from boundary register buffers that store the

(51)

37

neighboring pixels of previously reconstructed blocks. Then, for the bypass type, such as vertical and horizontal modes, the predictor does nothing but directly outputs the input value. For the linear type, from mode three to mode eight, desired values are obtained by reusing its partial sums. In which, the first-level adders generate values like (A+B+1), and then the second-level adders sum up the adjacent partial sums to compute the result of (A+2B+C+2). As to the average type, so called DC prediction, needs to sum up total eight pixels of neighboring blocks, four from upper block and four from left one, to figure out the average value each cycle. For luma 4x4 or chroma 8x8 DC modes, it takes one cycle to calculate the predicted DC value. However, four cycles are required for luma 16x16 DC prediction since up to 32 boundary pixels have to be accumulated. We use extra adders and a register to simplify the accumulation operations in the reconfigurable datapath instead of using existing adders, which saves more control and wiring circuits. After prediction, these values are handled through the difference unit to produce the residuals, which will be sent to the transform unit for further processing.

3.3.4. Transform Unit

The coefficients in transform matrices (5), (6), and (7) are even or odd symmetry at each row or column and can be easily implemented with addition and shift. The 2-D transform can also be separated into two 1-D transform with fast algorithm and butterfly architecture [16]. Since forward DCT and DHT have the same butterfly structures and will not operate at the same time in the codec, they can be merged together for area consideration. Similar architecture is applied to inverse transform. Though several transform designs have been proposed [17][18][19], this work adopts the same architecture as [20], as shown in Fig. 25, to execute integer DCT and DHT since it is

(52)

38

also four-pixel parallel and with lower hardware cost. In addition, two 4x4 block-size registers are located in both forward and inverse transform units to gather the DC coefficients after integer DCT and IDCT for further DHT computation of DC blocks.

Fig. 25 Hardware architecture of transform unit [20]

3.3.5. Quantization and De-quantization

The quantization and de-quantization units are shown in Fig. 26, where only one of the four-parallel datapath is displayed. Constant values of quantization coefficients are all implemented by look-up tables depending on QPs from Table 2 and Table 3 , as denoted by “quant_coef,” “dequant_coef,” “qp_const,” “qp_shift,” and “qp_per” in Fig. 26. A quantized value for entropy coding is obtained through a multiplication, an addition, and a shift operation. To recover the quantized data, a multiplication followed by rounding and shift is performed. These forward and inverse processes of quantization and transform are directly matched to (6) and (7). The design also uses skill of data guarding to reduce power consumption by skipping the zero input.

(53)

39

Fig. 26 Quantization and de-quantization unit

3.3.6. Cost Generation and Mode Decision Unit

After transform, the coefficients are sent to both cost generation unit for cost calculation and current block registers to temporarily be stored avoiding recomputation. Fig. 27 illustrates the cost generation and mode decision unit. The cost generation unit is implemented according to the enhanced SATD function in (11) that is divided in two stages: the integer transform that replaces the DHT and the extra scalar multiplication factor stage. The above replacement can eliminate the recomputation issue of transform in hardware design and also improve coding performance. The scaling integer factors are realized with a two-stage adder tree and simple shifters instead of multipliers to reduce hardware cost and critical paths.

The current cost registers are used to temporarily store the cost value for current mode. If the current mode belongs to the most probable modes, the non-zero initial cost value is given according to lambda value table instead of zero cost. The cost is than compared to the minimum cost value for 4x4 prediction or accumulated for 16x16 prediction. If a smaller cost value is detected, the minimum cost in 4x4 prediction is

(54)

40

replaced by the new one and coefficients in the current block registers in Fig. 19 are moved to the best block registers. This comparing and replacement procedure will be continued recursively until the best mode with minimum cost is obtained. Eventually, SATD costs of all 4x4 predictions and best one of 16x16 prediction are compared again to determine which prediction type is used in this macroblock. Similar operations are also applied to chroma components.

Fig. 27 Cost generation and mode decision unit

3.3.7. Reconstruction Path

The quantized residual values are stored in the coefficient buffer for entropy coding and also need to be reconstructed immediately since intra prediction unit requires its boundary pixels to predict successive blocks. Besides, this reconstruction process in the encoder can also be used for decoding as well. The reconstruction phase as shown in Fig.

適用於高解析度靜態影像與視訊應用之H.264/MPEG-4 AVC框內編解碼器設計

國

立 交 通 大 學

電子工程學系 電子研究所碩士班

碩 士 論 文

適用於高解析度靜態影像與視訊應用之

H.264/MPEG-4 AVC 框內編解碼器設計

Design of H.264/MPEG-4 AVC Intra Codec for High

Definition Size Still Image and Video Applications

研究生： 古君偉

指導教授： 張添烜 博士

中 華 民 國 九十五 年 七 月

適用於高解析度靜態影像與視訊應用之

H.264/MPEG-4 AVC 框內編解碼器設計

Design of H.264/MPEG-4 AVC Intra Codec for High

Definition Size Still Image and Video Applications

研 究 生：古君偉

Student:

Chun-Wei

Ku

指導教授：張添烜 博士 Advisor:

Dr.

Tian-Sheuan

Chang

國 立 交 通 大 學

電子工程學系 電子研究所

碩 士 論 文

A Thesis

Submitted to Institute of Electronics

College of Electrical Engineering and Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for Degree of Master of Science

in

Electronic Engineering

July 2006

誌謝

適用於高解析度靜態影像與視訊應用之

H.264/MPEG-4 AVC 框內編解碼器設計

研究生：古君偉

指導教授：張添烜博士

國立交通大學

電子研究所

摘要

Design of H.264/MPEG-4 AVC Intra Codec for High

Definition Size Still Image and Video Applications

Student:

Chen-Wei

Ku Advisor:

Dr.

Tian-Sheuan

Chang

Institute of Electronics

National Chiao Tung University

Abstract

Content

Chapter 1 Introduction ... 1

Chapter 2 Overview of H.264/AVC Standard... 5

Chapter 3 H.264/AVC Intra Frame Codec ...18

Chapter 4 Fast H.264/AVC Intra Frame Encoder ...48

List of Figures

List of Tables

Chapter 1

Introduction

1.1.

Motivation

1.2.

Thesis Organization

Chapter 2

Overview of H.264/AVC Standard

2.1.

Fundamental of H.264/AVC

2.1.1.

Coding Structure

2.1.2.

Features of Standard

2.1.3.

Profiles

2.2.

Components of Baseline Intra Coding

立交通大學

電子工程學系電子研究所碩士班

碩士論文

研究生：古君偉

指導教授：張添烜博士

中華民國九十五年七月

研究生：古君偉

指導教授：張添烜博士 Advisor:

國立交通大學

電子工程學系電子研究所

碩士論文