國 立 交 通 大 學
電子工程學系 電子研究所碩士班
碩 士 論 文
適用於高畫質視訊之移動估測設計
Hardware Efficient Motion Estimation Designs for High Definition
Video Compression
研究生: 林嘉俊
指導教授: 張添烜
中華民國 九十六年 七月
適用於高畫質視訊之移動估測設計
Hardware Efficient Motion Estimation Designs for High Definition
Video Compression
研 究 生: 林嘉俊 Student: Chia‐Chun Lin 指導教授: 張添烜 博士 Advisor: Tian‐Sheuan Chang 國 立 交 通 大 學 電子工程學系 電子研究所碩士班 碩 士 論 文 A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Master of Science In Electrical Engineering July 2007 Hsinchu, Taiwan, Republic of China 中華民國 九十六年 七月i
適用於高畫質視訊之移動估測設計
研究生: 林嘉俊 指導教授: 張添烜 博士 國立交通大學電子研究所碩士班 摘 要 移動估測在視訊編碼的過程中,具有非常高的複雜度,因此而成為即時影像編碼的 瓶頸。為了克服移動估測的諸多困難性,包含:龐大的運算量,高面積成本以及大 量的記憶體頻寬,我們將提出許多演算法以及其相對應的硬體架構來解決以上種種 問題。首先,我們提出一個高效能低成本適應性的跳躍方塊預測演算法及架構,藉 由預測可跳躍的靜態影像而降低運算量。其次,我們提出一個快速的模式決定演算 法來切割以及平衡整數點移動估測以及非整數點移動估測之間所需花費之時間,藉 此來增進整體硬體架構的平行度以及效能。此外,我們針對不同的應用層面來設計 不同的演算法及相對應架構來適應其使用環境。對於小畫面的可攜性元件,我們提 出一個有效率低成本低耗能的調適性四分之一像素快速移動估測設計。另一方面, 對於大畫面高解析之視訊應用,我們提供另一個高效率平行化之多解析度移動估測 設計來支援大搜尋範圍,而提供高品質低位元率之效能。最後,我們將上述方法整 合到一顆支援每秒 30 張之 1080p 畫面的高規範視訊晶片,製作一個完整的視訊壓縮 晶片。iii
Hardware Efficient Motion Estimation Designs for High Definition Video
Compression
Student: Chia‐Chun Lin Advisor: Tian‐Sheuan Chang Institute of Electronics National Chiao Tung University Abstract Motion estimation (ME) processing is the most complex part and the bottle neck of a real time video encoder due to its heavy complexity, high area cost, and large memory bandwidth. In this thesis, we propose fast algorithms and architectures to solve these issues. For the fast algorithms, first, we introduce a low cost adaptive skip mode detection algorithm and its architecture to encode the static portion of video in an efficient way. Second, a fast mode decision algorithm is presented to save hardware computing cycles by separating the integer‐pixel ME and fractional‐pixel ME phase. In the architecture designs, we propose two different ME designs for portable and high definition applications. For portable small size video gadgets, we propose low cost and low power refined quarter motion estimation hardware to solve the cost problem. For large frame size high definition video, we use parallel multi‐resolution motion estimation to offer large search region. Finally, we integrate these methods into a high profile encoder chip which supports 1080p video under 145MHz.iv
v
誌 謝
首先,要感謝我的指導教授—張添烜博士,這兩年來給我的支持和鼓勵,帶領 我由淺入深一步一腳印的學習與研究,也讓我在想法上能自由發揮,發揮我最大的 潛能與創意,而每當遇到問題和疑問時能夠給予我建議與協助。因此,我對與張教 授的感激之情溢於言表。謝謝我的口試委員們,交大電子李鎮宜系主任和清華大學 陳永昌教授,感謝你們百忙中抽空來指導我,因為你們寶貴的意見讓我的論文更加 完備。 感謝 VSP 實驗室的好伙伴們,特別要謝謝引我入門的林佑昆學長,帶領我從零 開始,一點一滴紮實的研究與實作,也給予我不少中肯有用的建議。感謝張彥中學 長、李國龍學長,你們傳給我的經驗與知識,讓我受用不盡。謝謝古君偉和王裕仁 學長教導我許多 IC 設計的觀念與技巧,也感謝廖英澤同學,陪我連續參加 IC 競賽, 一起切磋晶片製作的實力。感謝李得瑋、郭子筠、吳秈璟同學,跟你們一起整合研 究一顆完整的編碼晶片,是一個難得的過程,在我們一同解決問題的過程中,我學 習到許多,這也是一段刻苦銘心的回憶。感謝蔡宗憲、曾宇晟、詹景竹、張瑋城、 戴瑋呈學弟們,有你們的陪伴,我的碩士班生涯充滿了歡笑。謝謝實驗室的所有成 員們,和你們一同奮鬥、流汗與歡樂的過程,都是我在交大寶貴的回憶。 謝謝我的女友,謝謝你不斷的支持與鼓勵,也讓我對未來的學習之路有了全新 的轉變。也感謝社團朋友的支持,跟你們一同出遊爬山是我減輕壓力的最好方式, 再創我精神與創意的高峰。 最後要感謝默默支持我的家人們,我的爸媽、弟弟,你們的溫暖是我努力最大 的支柱。 在此,把本論文獻給所有愛我與所有我愛的人。vi
vii
Contents
1. INTRODUCTION ... 1 1.1. MOTIVATION ... 1 1.2. CONTRIBUTION OF THE THESIS ... 3 1.3. ORGANIZATION OF THE THESIS ... 5 2. OVERVIEW OF H.264/AVC STANDARD ... 7 2.1. OVERVIEW ... 7 2.2. CODING STRUCTURE ... 8 2.3. INTRA PREDICTION ... 9 2.4. INTER PREDICTION ... 10 2.5. IN‐LOOP FILTER ... 102.6. CONTEXT‐BASED ADAPTIVE BINARY ARITHMETIC CODING (CABAC) ... 10
3. OVERVIEW OF BLOCK MATCHING MOTION ESTIMATION ... 11
3.1. BLOCK‐BASED MOTION ESTIMATION ... 11
3.2. THE MATCHING CRITERIA ... 14
3.3. QUALITY JUDGMENT ... 15
3.4. REVIEW OF MOTION ESTIMATION ALGORITHM ... 16
3.4.1. FULL SEARCH ALGORITHM (FSA) ... 16
3.4.2. THREE STEPS SEARCH ALGORITHM (3SS) ... 16
3.4.3. QUARTER PIXEL MOTION ESTIMATION (QME) ... 17
3.4.4. MULTI RESOLUTION MOTION ESTIMATION (MRME) ... 19
3.5. REVIEW OF SKIP MODE DETECTION ... 21
3.5.1. LAGRANGIAN COST MOTION ESTIMATION ... 21
3.5.2. ALL ZERO DCT BLOCKS DETECTION ... 22
viii
4.1. INTRODUCTION ... 23
4.2. THE FAST SKIP ALGORITHM ... 25
4.2.1. 4X4‐BLOCK‐SAD‐THRESHOLD ... 25
4.2.2. MB‐ZERO‐BLOCK‐THRESHOLD ... 26
4.2.3. SPIKE‐THRESHOLD ... 28
4.3. THE FAST SKIP ARCHITECTURE ... 28
4.4. EXPERIMENTAL RESULT ... 30
4.5. SUMMARY ... 39
5. HARDWARE EFFICIENT FAST MODE DECISION ALGORITHM ... 41
5.1. INTRODUCTION ... 42
5.2. MODE FILTERING ALGORITHM ... 43
5.3. THE SIMULATION RESULT OF MODE FILTERING ... 44
5.3.1. PERFORMANCE OF QCIF/CIF SEQUENCES ... 44
5.3.2. PERFORMANCE OF 720P SEQUENCES ... 49
5.4. SUMMARY ... 52
6. EFFICIENT LOW COST MOTION ESTIMATION FOR PORTABLE DEVICES ... 53
6.1. INTRODUCTION ... 53
6.2. REFINED QUARTER MOTION ESTIMATION (RQME) ... 55
6.3. MODE FILTERING ... 56
6.4. PERFORMANCE ANALYSIS ... 57
6.4.1. PERFORMANCE OF MF+RQME ... 57
6.4.2. PERFORMANCE OF SKIP+MF+RQME ... 59
6.5. THE RQME ARCHITECTURE ... 66
6.6. HARDWARE IMPLEMENTATION RESULT ... 70
6.7. SUMMARY ... 72
7. EFFICIENT LARGE SEARCH RANGE MOTION ESTIMATION FOR HIGH DEFINITION VIDEO COMPRESSION ... 73
ix
7.1. INTRODUCTION ... 73
7.2. PARALLEL MULTI‐RESOLUTION MOTION ESTIMATION (PMRME) ... 75
7.3. MODE FILTERING ... 77
7.4. BIT TRUNCATION ... 77
7.5. THE PMRME ARCHITECTURE ... 78
7.6. SEARCH SCHEDULING ... 84
7.7. MEMORY ALLOCATION ... 90
7.8. MEMORY SCHEDULE ... 95
7.9. EXPERIMENTAL AND IMPLEMENTAL RESULT ... 96
7.9.1. PERFORMANCE OF PMRME ... 96
7.9.2. PERFORMANCE OF SKIP + PMEME ... 100
7.10. IMPLEMENTATION RESULT ... 105
7.11. SUMMARY ... 106
8. INTEGRATION FOR 1080P H.264/AVC HIGH PROFILE ENCODER ... 107
8.1. INTRODUCTION ... 107
8.2. SYSTEM ARCHITECTURE FOR BI‐DIRECTION MOTION ESTIMATION ... 108
8.3. FRAME‐LEVEL MEMORY SCHEDULING ... 111
8.4. BANDWIDTH ANALYSIS ... 115
8.5. CONCLUSION ... 117
9. CONCLUSION ... 119
9.1. ADAPTIVE SKIP MODE DETECTION ... 119
9.2. MODE FILTERING ... 119
9.3. REFINED QUARTER PIXEL MOTION ESTIMATION ... 119
9.4. PARALLEL MULTI RESOLUTION MOTION ESTIMATION ... 120
9.5. 1080P HIGH PROFILE ENCODER CHIP ... 120
9.6. FUTURE WORK ... 120
x
xi
List of Figure
FIGURE 1 THE CONTRIBUTION OF THE THESIS ... 3 FIGURE 2 THE BASIC STRUCTURE OF ENCODER... 9 FIGURE 3 THE BASIC STRUCTURE OF DECODER ... 9 FIGURE 4 THE HIERARCHY OF A MACROBLOCK ... 12 FIGURE 5 DIFFERENT MODES AND ITS BLOCK SIZE ... 12 FIGURE 6 THE MOTION VECTOR AND THE SEARCH RANGE ... 13 FIGURE 7 THE SEARCH STEPS OF THREE STEPS SEARCH ALGORITHM ... 16FIGURE 8 THE QME ALGORITHM ... 17
FIGURE 9 THE CONVENTIONAL MULTI RESOLUTION ALGORITHM ... 19
FIGURE 10 THE SKIP DETECTION FLOW ... 25
FIGURE 11 THE SKIPPED MB IN THE SEQUENCE “TABLE” IN FRAMES 171 AND 172 ... 27
FIGURE 12 THE MB‐ZERO‐BLOCK‐THRESHOLD PREDICTION ... 27
FIGURE 13 THE SYSTEM ARCHITECTURE ... 28
FIGURE 14 THE SKIP DETECT ARCHITECTURE ... 30
FIGURE 15 THE RD CURVE OF LOW MOTION SEQUENCES ... 33
FIGURE 16 THE DISTRIBUTION OF MB FOR LOW MOTION SEQUENCES ... 33
FIGURE 17 THE RD CURVE OF MEDIUM MOTION SEQUENCES ... 34
FIGURE 18 THE DISTRIBUTION OF MB FOR MEDIUM MOTION SEQUENCES ... 34
FIGURE 19 THE RD CURVE OF HIGH MOTION SEQUENCES ... 35
FIGURE 20 THE DISTRIBUTION OF MB FOR HIGH MOTION SEQUENCES ... 35
FIGURE 21 THE RD CURVE OF 720P SEQUENCES ... 36
FIGURE 22 THE DISTRIBUTION OF MB FOR 720P SEQUENCES ... 36
FIGURE 23 CODING TIME (%) OF CIF SEQUENCES ... 37
FIGURE 24 CODING TIME (%) OF 720P SEQUENCES ... 37
FIGURE 25 ALGORITHM OF MODE FILTERING ... 41
FIGURE 26 THE MODE FILTERING SIMULATION RESULT FOR AKIYO SEQUENCE (QCIF) ... 45
FIGURE 27 THE MODE FILTERING SIMULATION RESULT FOR FOREMAN SEQUENCE (QCIF) ... 45
FIGURE 28 THE MODE FILTERING SIMULATION RESULT FOR MOBILE SEQUENCE (QCIF) ... 46
FIGURE 29 THE MODE FILTERING SIMULATION RESULT FOR AKIYO SEQUENCE (CIF) ... 46
FIGURE 30 THE MODE FILTERING SIMULATION RESULT FOR FOREMAN (CIF) ... 47
FIGURE 31 THE MODE FILTERING SIMULATION RESULT FOR MOBILE (CIF) ... 47
FIGURE 32 THE MODE FILTERING SIMULATION RESULT FOR STOCKHOLM (720P) ... 50
xii
FIGURE 34 THE RQME ALGORITHM ... 54
FIGURE 35 THE PARTITION OF CURRENT BLOCK AND SEARCH RANGE ... 56
FIGURE 36 THE AVERAGED R‐D CURVE OF PROPOSED ALGORITHM ... 58
FIGURE 37 SKIP+MF+RQME PERFORMANCE OF QCIF SILENT (LOW MOTION) ... 61
FIGURE 38 SKIP+MF+RQME PERFORMANCE OF QCIF CARPHONE (MEDIUM MOTION) ... 62
FIGURE 39 SKIP+MF+RQME PERFORMANCE OF QCIF MOBILE (HIGH MOTION) ... 62
FIGURE 40 SKIP+MF+RQME PERFORMANCE OF CIF AKIYO (LOW MOTION) ... 64
FIGURE 41 SKIP+MF+RQME PERFORMANCE OF CIF CONTAINER (MEDIUM MOTION) ... 64
FIGURE 42 SKIP+MF+RQME PERFORMANCE OF CIF HALL (HIGH MOTION) ... 65
FIGURE 43 THE PROPOSED ARCHITECTURE ... 66
FIGURE 44 THE STRUCTURE OF A QSAD MODULE ... 67
FIGURE 45 THE STRUCTURE OF AN ROW QME MODULE ... 67
FIGURE 46 THE STRUCTURE OF A PE ... 67
FIGURE 47 THE DATA FLOW THE EVEN ROW ... 68
FIGURE 48 THE STRUCTURE OF THE SAD TREE ... 69
FIGURE 49 POWER ANALYSIS OF RQME DESIGN ... 71
FIGURE 50 THE THREE LEVEL PARALLEL MULTI RESOLUTION MOTION ESTIMATION ... 75
FIGURE 51 THE CONCEPT OF PMRME ... 77
FIGURE 52 THE PROPOSED PMRME ARCHITECTURE ... 78
FIGURE 53 THE PRIMITIVE MODULE ... 80
FIGURE 54 THE ARCHITECTURE OF SAD MODULE ... 80
FIGURE 55 LEVEL 0 ME MODULE ... 81
FIGURE 56 LEVEL 1 ME MODULE ... 81
FIGURE 57 LEVEL 2 ME MODULE ... 82
FIGURE 58 THE “4X4 SAD TREE” ... 83
FIGURE 59 THE “8X8 SAD TREE” ... 83
FIGURE 60 THE REFERENCE CONTROL OF LEVEL 0 ... 86
FIGURE 61 DATA FLOW DIRECTION OF LEVEL 0 ME (TIME‐SPACE REPRESENTATION) ... 86
FIGURE 62 THE PIPELINED SEARCH SCHEDULE OF LEVEL 0 ... 86 FIGURE 63 THE SEARCH FLOW OF LEVEL 1 ... 88 FIGURE 64 PARALLEL DATA REUSE IN LEVEL 1 ... 88 FIGURE 65 THE REFERENCE CONTROL OF LEVEL 1 ... 88 FIGURE 66 THE SEARCH FLOW OF LEVEL 2 ... 89 FIGURE 67 THE REFERENCE CONTROL OF LEVEL 2 ... 89 FIGURE 68 MEMORY ALLOCATION OF LEVEL 0 ... 91
xiii
FIGURE 69 MEMORY ALLOCATION OF LEVEL 1 ... 92
FIGURE 70 THE MEMORY ALLOCATION OF LEVEL 2 ... 93
FIGURE 71 THE DATA REUSABILITY DEGREE IN DIFFERENT LEVEL ... 94
FIGURE 72 THE BLOCK DIAGRAM OF IME AND FME ... 95
FIGURE 73 PMRME PERFORMANCE OF 720P SEQUENCES ... 98
FIGURE 74 PMRME PERFORMANCE OF 1080P SEQUENCES ... 99
FIGURE 75 SKIP+PMRME PERFORMANCE OF 720P STOCKHOLM ... 101
FIGURE 76 SKIP+PMRME PERFORMANCE OF 720P PARK_RUN ... 101
FIGURE 77 SKIP+PMRME PERFORMANCE OF 720P SHIELDS ... 102
FIGURE 78 SKIP+PMRME PERFORMANCE OF 1080P STATION2 ... 103
FIGURE 79 SKIP+PMRME PERFORMANCE OF 1080P RUSH_HOUR ... 104
FIGURE 80 SKIP+PMRME PERFORMANCE OF 1080P SUNFLOWER ... 104
FIGURE 81 THE CONCEPT OF BI‐DIRECTIONAL MOTION ESTIMATION ... 107
FIGURE 82 SYSTEM ARCHITECTURE OF BI‐DIRECTIONAL MOTION ESTIMATION ... 108
FIGURE 83 MEMORY SCHEDULE OF BI‐DIRECTION ME ... 109
FIGURE 84 CODING ORDER OF 720P SEQUENCES ... 111
FIGURE 85 THE RELATIONSHIP OF STRIPE OF CURRENT MB AND SEARCH RANGE IN LEVEL 1 ... 112
FIGURE 86 THE REFRESHED REFERENCE DATA OF EVERY MB IN LEVEL 1 ... 113
FIGURE 87 THE REFRESHED REFERENCE DATA OF EVERY MB IN LEVEL 2 ... 114
FIGURE 88 BANDWIDTH OF PMRME LEVEL 0 (720P) ... 115
FIGURE 89 BANDWIDTH OF PMRME LEVEL 1 (720P) ... 116
FIGURE 90 BANDWIDTH OF PMRME LEVEL 2 (720P) ... 116
xiv
xv
List of Table
TABLE I THE MODE TYPE AND ITS BLOCK SIZE FOR H.264 ... 12
TABLE II BOUNDARY DETERMINATION OF QP28 ... 26
TABLE III THE 4X4‐BLOCK‐SAD‐THRESHOLD AND SPIKE‐THRESHOLD UNDER DIFFERENT QP ... 26
TABLE IV PERFORMANCE OF PRE‐SKIP DETECTION FOR LOW MOTION CIF SEQUENCES ... 31
TABLE V PERFORMANCE OF PRE‐SKIP DETECTION FOR MEDIUM MOTION CIF SEQUENCES ... 31
TABLE VI PERFORMANCE OF PRE‐SKIP DETECTION FOR HIGH MOTION CIF SEQUENCES ... 31
TABLE VII PERFORMANCE OF PRE‐SKIP DETECTION FOR 720P SEQUENCES ... 31
TABLE VIII THE AVERAGE CODING TIME (%) FOR CATEGORIZED CIF AND 720P SEQUENCES ... 38
TABLE IX THE HARDWARE COST OF THE SKIP DESIGN ... 38
TABLE X THE RELATIONSHIP OF CANDIDATES AND MOTION VECTORS ... 43
TABLE XI THE MODE FILTERING PERFORMANCE FOR QCIF SEQUENCE ... 48
TABLE XII THE MODE FILTERING PERFORMANCE FOR CIF SEQUENCE ... 49
TABLE XIII THE MODE FILTERING PERFORMANCE FOR 720P SEQUENCE ... 51
TABLE XIV MF+RQME PERFORMANCE FOR CIF SEQUENCES ... 58
TABLE XV SKIP+MF+RQME PERFORMANCE FOR QCIF SEQUENCES ... 61
TABLE XVI SKIP+MF+RQME PERFORMANCE FOR CIF SEQUENCES ... 63
TABLE XVII RQME COMPARISONS WITH PREVIOUS WORKS ... 71
TABLE XVIII MEMORY AND BANDWIDTH FOR DIFFERENT FRAME SIZE ... 94
TABLE XIX PMRME PERFORMANCE FOR 720P SEQUENCES ... 98
TABLE XX PMRME PERFORMANCE FOR 1080P SEQUENCES ... 99
TABLE XXI SKIP+PMRME PERFORMANCE FOR 720P SEQUENCES ... 100
TABLE XXII SKIP+PMRME PERFORMANCE FOR 1080P SEQUENCES ... 103
TABLE XXIII THE PMRME HARDWARE COST COMPARISON ... 105
TABLE XXIV AVERAGE BANDWIDTH PER MB ... 116
xvi
1
1. Introduction
1.1. Motivation
The emerging multimedia technology such as digital television, mobile phone and DVD play indispensible roles in our daily life. These products become our main way to acquire information from the world, to communicate to each other and to entertain ourselves. However, the multimedia information is too large to transmit or record without effective compress them. Therefore, the issue of how to effectively compress the data becomes an important part of multimedia research nowadays. In short, video compression is a technology to transform video signals and try to maintain original quality under a number of constraints such as storage constraint, real time constraint or computation power constraint. It needs effectively exploiting the redundancy within or between frames to reduce the data rates with minimum video quality loss. Thus, the design of data compression systems normally involves a tradeoff between quality, speed, resource utilization and power consumption.
The compression technique in a video scene includes removing data redundancy of spatial, temporal and statistical correlation between frames. The main concept to remove such redundancy is because our human eye and brain (Human Visual System) are more sensitive to lower frequencies and thus enables us to diminish the information of higher frequency to decrease total bit rates. Thus, by removing different types of redundancy, it is possible to compress the data significantly at the expense of a certain amount of information loss (distortion) and further compression can be achieved by encoding the processed data using an entropy coding.
Within these compression techniques, motion estimation and motion compensation are widely used in video compression to reduce the temporal redundancy in video contents.
2
It is a very efficient and practical way to predict the motion of adjacent frame by using few bit rates; whereas, it occupies very high computational complexity in whole encoding process. Besides, in the recently standard—H.264/AVC, variable block size motion estimation (VBSME) consisting of integer ME (IME) and fractional ME (FME) is adapted to fit different details of video sequences. However, the long coding time and large power consumption of motion estimation becomes the main problem eager to solve in the encoder.
3
1.2. Contribution of the Thesis
Figure 1 shows main contributions of this thesis within the basic flow of motion estimation. In the mode decision phase, we detect the skip mode and do mode filtering to decrease the complexity of mode combination. Afterward, we estimate several the motion vectors in the integer motion estimation phase; however, the further refinement step for fractional motion vectors will not be included in the thesis. This thesis presents a number of integer motion estimation algorithms and its architectures for variable block size motion estimations. The following novel contributions result from this work. Figure 1 the contribution of the thesis
4
9 Adaptive skip mode detection: we propose low cost adaptive skip detection hardware algorithm and its architecture to save the computation power for static macroblock. The hardware costs are 0.63K gate counts and small size memory for look‐up table. In some low motion and highly quantized sequences, our method can accurately skip 82.39% macroblocks.
9 Mode Filtering (MF): it is a fast algorithm that can speed up the overall motion estimation process with the reduction of fraction motion estimation modes. With different mechanism, the algorithm can be applied to 2‐D array design or 1‐D array designs. By using the algorithm, the hardware implementation can be pipelined with higher efficiency with slightly performance loss.
9 Refined Quarter Motion Estimation (RQME): it uses the MF to reduce the computational load of fractional motion estimation and the quarter pixel method to enable four times of parallel processing with low computational complexity and low quality loss. Besides, the proposed hardware architecture only needs half the number of process elements and less latency than the general 2‐D architectures. This design is very suitable for portable device which has the characters of low cost and small frame size.
9 Parallel Multi Resolution Motion Estimation (PMRME): the method applies parallel multi resolution motion estimation, MF and bit truncation to support large search range [‐128, 127] within 256 cycles for p‐frame (one‐direction) motion estimation. Because of the fast searching mechanism, the design also can support b‐frame (bi‐directional) motion estimation with only 512 cycles. In addition, this design can save at least 91.91% of memory buffer and 55.1% of bandwidth. The resulted hardware also save up to 48.9% of area cost and 62.1% of memory cost compared to previous approach for 1080P processing. With above features, the proposed design is suitable for larger search range application such as HDTV.
5
9 A 1080p high profile encoder chip for H.264: we integrate our design into a high performance H.264 high profile encoder that can support 1080p resolution under 145MHz with smaller area. In this integration, we optimize the algorithm and architecture of the motion estimation component as well as the memory organization and pipeline schedule of the whole design to achieve a high throughput and low hardware cost design.
1.3. Organization of the Thesis
The main theme of the thesis is to study different implementation methods for motion estimation in the standard of H.264/AVC [1]. In chapter 2, we briefly introduce the background and basic tools of H.264/AVC standard. In chapter 3, we give an overview of the basic concepts and several algorithms of motion estimation. In chapter 4, we propose an adaptive skip mode detection to lower the computation overhead for static macroblocks. In chapter 5, we propose the algorithm of mode filtering and its performance to simplify the mode combination and enhance throughput of motion estimation. In chapter 6, we present the refined quarter motion estimation design to fit some low cost, small frame size application of H.264. In chapter 7, a large search range PMRME design is proposed to deal with the large frame size application. Then, in chapter 8, we integrate the method mentioned above to implement an encoder which supports 1080p high profile of H.264/AVC. Finally, a conclusion is given in chapter 9.
6
7
2. Overview of H.264/AVC Standard
2.1. Overview
Image and video compression has been a very active field of research and development for over twenty years. Many different systems and algorithms for compression and decompression have been proposed and developed. In order to achieve inter‐working, industrial competition and possibility of popularity, it is necessary to define standard methods for decoding to allow products from different manufacturers to communicate to each other effectively. Therefore, the standardization process has contributed to the prevalence of broadcast television and home entertainment nowadays. Recently, the ISO (International Standard Organization) MPEG4 standard is enabling a new generation of internet‐based video applications while the ITU‐T (Telecommunication Standardization Sector) H.263 standard for video compression is now widely used in videoconference systems.
MPEG4 and H.263 are standards that are based on video compression technology start with about 1995. The two groups responsible for these standards: the one is Motion Picture Experts Group (MPEG) and the other is Video Coding Experts Group (VCEG), both of them are in the final stages of developing a new standard that promises to significantly outperform MPEG4 and H.263. It provides better compression of video images by properly adopting a variety of tools to supporting high‐quality and low bit rate streaming video. In the VCEG side, after finishing the original H.263 standard 1995, the VCEG started work on two further development areas: a short‐term effort to add extra features to H.263 and a long‐term effort to develop a new standard for low bit rate visual communications. The long‐term effort led to the draft “H.26L” standard, offering significantly better video compression efficiency than previous ITU‐T standards. In 2001, the MPEG recognized the potential benefits of H.26L; therefore the Joint Video
8
Team (JVT) was formed, including experts from MPEG and VCEG. JVT’s main task is to develop the draft H.26L model into a full international standard. In fact, the outcome will be two identical standards: ISO MPEG4 Part 10 of MPEG4 and ITU‐T H.264. The official title of the new standard is Advanced Video Coding (AVC); however, it is widely known by its old working title, H.26L and by its ITU document number, H.264 [1].
H.264 consists of numerous of tools. Compared to the prior video coding standards, many important and new techniques are employed and bring significant improvement on coding performance. Some details of these techniques can be found in [2]. Here, we would like to give a brief introduction of the basic concepts of these tools, which have existed for some time but nicely tuned and well integrated together to form a good compression scheme in H.264.
2.2. Coding Structure
In common with earlier standards, the H.264 standard does not explicitly define a CODEC (encoder / decoder pair). Instead, the standard defines the syntax of an encoded video bit stream together with the method of decoding. Actually, a compliant encoder and decoder are likely to include the functional elements shown in Figure 2 and Figure 3; besides, the functions shown in these figures are likely to be necessary for compliance. In these figure, we can find that the decoder system is a part of the encoder, whereas there are a certain range for considerable variation in the structure.
In general, most of the video coding systems are based on the motion estimation and motion compensation mechanism along with some other tools to reduce the neighboring frame redundancy. The basic functional elements (prediction, transform, quantization, entropy encoding) are little different from previous standards (MPEG1, MPEG2, MPEG4, H.261, H.263, etc.).
9 Figure 2 the basic structure of encoder Reference Frame Re-contstructed Frame Motion Compensat ion Intra prediction Inverse Discrete Cosine Transform Inverse Quantizatio n Reorder Entropy Coding Filter Inter Intra + + Figure 3 the basic structure of decoder
2.3. Intra Prediction
If a block or macroblock is encoded in intra mode, a prediction block is formed based on previously encoded and reconstructed blocks. This prediction block is subtracted from the current block prior to encoding. In H.264 [1], for the luminance (luma) block, it may be formed for each 4x4 subblock or for a 16x16 macroblock. There are a total of 9 optional prediction modes for each 4x4 luma block and 4 optional modes for a 16x16 luma block and one mode that is always applied to each 4x4 chrominance (chroma) block.10
2.4. Inter Prediction
Inter prediction creates a prediction model from one or more previously encoded video frames. The model is formed by shifting samples in the reference frame(s) (motion compensated prediction). The AVC CODEC uses block‐based motion compensation, the same principle adopted by every major coding standard since H.261. Important differences from earlier standards include that the H.264 supports for a variety of range of block sizes (down to 4x4) and fine sub‐pixel motion vectors (1/4 pixel in the luma component).
2.5. In‐loop Filter
In H.264, a filter is applied to every decoded macroblock in order to reduce blocking distortion caused by block‐based transformation. In the encoder, the deblocking filter is applied after the inverse transform and before reconstructing and storing the macroblock for future predictions. In the decoder, it is applied before reconstructing and displaying the macroblock. The filter has two benefits: in the first place, block edges are smoothed, improving the appearance of decoded images, especially at higher compression ratios. In the second place, the filtered macroblock is used for motion‐compensated prediction of further frames in the encoder, resulting in a smaller residual after prediction.
2.6. Context‐based Adaptive Binary Arithmetic Coding (CABAC)
An arithmetic coding system is used to encode and decode H.264 syntax elements. The arithmetic coding scheme selected for H.264, Context‐based adaptive binary arithmetic coding (CABAC) achieves good compression performance through for two reasons: first, it selects probability models for each syntax element according to the element’s context. Second, it adapts probability estimates based on local statistics by using arithmetic coding.11
3. Overview of Block Matching Motion Estimation
For video compression, consecutive frames in a video sequence can be regarded as a set of object appropriately displaced from frame to frame. If the motion trajectory of every object in current frame could be predicted from the previous frame, we only have to record and transmit the trajectory information to the decoder. In this way, we encode the trajectory of the object instead of the pixel information; thus we can diminish the required bits a lot for the video sequence. The trajectory of the object, which we call it motion vector (MV) is needed for decoder to do motion compensation to reconstruct the frame. The process of determining the motion vector in the encoder is called motion estimation (ME) and the maximum value of motion vector is determined by its search range
3.1. Block‐based Motion Estimation
There are several ways to do motion estimation. One is object‐based motion estimation that detects the outline of the object first and then estimates its motion vector [3]. The other way is block‐based motion estimation, which is the most widely used motion estimation method for video coding since most of the pictures are normally rectangular in shape and block‐division can be easily done. Besides, the block‐base method can significantly reduce complexity compared with the object‐based method.
In H.264 [1], the standard defines the standard block sizes for motion estimation. As illustrated in Figure 4 , one frame consists of several macroblocks (MB), which are “16 by 16” pixels square. In one macroblock, it can be divided into four “8 by 8” pixels 8x8 blocks, and within one 8x8 block, it can be further cut into four “4 by 4” pixels 4x4 blocks. The standard of H.264 defines several block type (mode) and its corresponding block size for motion estimation as listed in TABLE I and Figure 5.
12 MB MB MB MB MB MB MB 16 16 …… …… 8x8 8x8 8x8 8x8 8 8 MB 4x4 4x4 4x4 4x4 frame 4 4 Figure 4 the hierarchy of a macroblock TABLE I the mode type and its block size for H.264 Mode Block size Mode 1 16x16 Mode 2 16x8 Mode 3 8x16 Mode 4 8x8 Mode 5 8x4 Mode 6 4x8 Mode 7 4x4 Figure 5 different modes and its block size
13
The goal of motion estimation is to accurately predict the motion vector inside the search range of previous frame. However, not only the MV but also the block size will determine the quality of prediction and accuracy. Figure 6 shows the relation of motion vector, search range and the distribution of block sizes within a picture. It is easy to see that the detailed region is associated with small blocks whereas the large uniform region is associated with large blocks. Hence, a macroblock can be composed by variable block size, and this method is known as variable block size motion estimation (VBSME). Reference frame Current frame Search range MB MB MB (dx, dy) Ry Rx N N Ik(x,y) Ik(x,y) Ik-1(x,y) Figure 6 the motion vector and the search range
14
3.2. The Matching Criteria
Block‐based motion estimation obtains the best match by minimizing a cost function. Although there are several cost functions [4], the common used criterion is sum of absolute difference (SAD). It is because of its low complexity, good performance and ease of hardware implementation. The cost function is defined as: SAD dx, dy |IK m, n I m dx, n dy | λ R dx, dy MVP N N
MV MVx, MVy min , ASAD ,
Ik(x, y): pixel intensity at location (x, y) in k‐th frame (current frame) Ik‐1(x, y): pixel intensity at location (x, y) in k‐1‐th frame (reference frame) λ: Lagrangian multiplier MVP: the predicted motion vector R: number of bits to code the motion vector difference (MVD) = (dx, dy)‐MVP A: the region of search range MV: motion vector
The former term of the function means the residual cost of the search point, and the latter term is the cost of the motion vector difference (MVD). We can find that the Lagrangian multiplier will influence the weighting of the motion vector cost. Therefore, when the Lagrangian is larger, the motion estimation mechanism will prone to choose larger block type because less motion vectors are needed and vice versa. We designers need to make balance between these two costs. Since the introduction of variable block size motion estimation in H.264/AVC, one macroblock can produce more than one motion vector due to the existence of different kinds of blocks. In H.264, 41 motion vectors and their corresponding costs should be produced in one macroblock to choose the best combination and this is known as mode selection.
15
3.3. Quality Judgment
The quality of a video sequence can be determined by using both objective and subjective approaches. The most widely used objective measure is the peak‐signal‐to‐noise‐ratio (PSNR) which is defined as: PSNR 10 log 255 MSE MES: the mean‐square‐error of decoded frame and original frame
The peak value is 255 since the pixel value is 8 bits in depth (0~255). The higher the PSNR means the higher the quality of video. Although PSNR can objectively represent the quality of coding, it does not equal the subjective quality. Subjective quality is determined by a number of human testers and a conclusion is drawn based on their opinions. In some cases high PSNR results in low subjective quality. However, in most cases, PSNR provides a good approximation to the subjective measure and we use this measure in the rest of the thesis.
The PSNR and bit‐rate are usually conflicting. According to the rate‐distortion theory, low bit‐rate always accompanies low quality (larger distortion) and vice versa.
16
3.4. Review of Motion Estimation Algorithm
3.4.1. Full Search Algorithm (FSA)
It is conspicuous that the most accurate strategy to find best motion vector is the full search algorithm (FSA) which exhaustively searches all possible search points within a predetermined search range to find the best motion vector. Although this method has heavy burden computation, the method has the characteristic of regular search flow and this feature enables it very suitable for hardware implementation. The data of the search range can be fully reused, and it really diminishes the enormous memory required during motion estimation. In addition, the computation of FSA can be decreased by using some technique to predict the skip macroblock.
3.4.2. Three Steps Search Algorithm (3SS)
first step second step third step 16 4 2 1 Figure 7 the search steps of three steps search algorithm A representative work of the fast search algorithm is three‐step search (3SS) algorithm [5]. 3SS is widely used because of its good performance and simplicity. It relies on a monotonically increasing match criterion around the location of the optimal motion vector to iteratively determine that location. Figure 7 shows an example to illustrate the17
3SS algorithm. For search range equals 16x16, 3SS requires 25 (9+8+8) search points per macroblock, leading to a speedup of 9 when compared with 225 search points per macroblock for the FSA algorithm. However, the main disadvantage of 3SS is that it is inefficiency to estimate small motions, since the points forming the search pattern in the first step are positioned uniformly at relatively large distance around the center of the search window. Nevertheless, most of the motions in real world have a center‐biased motion vector distribution [6]. In addition, in view of hardware design, we have to consider the branch condition of 3SS, which will cause the bauble effect in a pipelined hardware design and lower the average throughput. Therefore, we recommend using FSA in hardware design to enhance our performance. On the other side, in order to reduce the computation power (which is also the main concern in hardware design), we would like to use hardware oriented skip algorithm to predict the skip mode to lower the computation power and main high quality performance.
3.4.3. Quarter Pixel Motion Estimation (QME)
Figure 8 the QME algorithm The Quarter Motion Estimation (QME) [7] is an algorithm of our previous design. As in Figure 8, the SAD is the sum of absolute difference as mentioned earlier, whereas the18
QSAD is quarter pixel (the black dot) absolute difference. In [7], the algorithm can be divided into two stages. The first stage is to perform full search with QSAD as the matching criterion, and keep U candidates that have smallest QSAD values. The second stage is to calculate the SAD values for these U candidates and select the candidate which has the smallest SAD as the final result.
This is a quite efficient algorithm; however, in the design; it does not support VBSME. Therefore, base on this algorithm, we make several modifications and further propose another refined quarter motion estimation to support the VBSME.
19
3.4.4. Multi Resolution Motion Estimation (MRME)
A conventional MRME [8] is shown Figure 9. It uses three hierarchical levels for search and refines the motion vector from the coarse level to the finest level. At first, it searches two minimum cost motion vector in level 2, which has coarsest resolution. In the second, it refines these two motion vectors along with the predicted motion vector (MVP) in level 1 and selects motion vector which has the minimum cost of them. At last, it further refines this motion vector to get the final result. Figure 9 the conventional multi resolution algorithm
It seems to be an effective algorithm to reduce the search timing; however, this approach has several disadvantages for hardware implementation. First, the motion vector found in the higher level needs to be further refined in the lower level. It means the search is a sequential process that will increase the cycle counts, and decrease the hardware utilization and throughput. Besides, a full search range sized buffer is still needed because the dependency between the three hierarchical levels. It will greatly increase the hardware costs if we directly adapt the algorithm for large search range design. Last but not the least; the required bandwidth is still quite large because of poor data reuse in the refinement process.
20
In this thesis, we propose a highly parallelized MRME not only solve the problems but further supports a large search range motion estimation to meet the high requirement for HDTV applications.
There are still a lot of fast search algorithms such as cross search (CS) [9], one‐dimensional gradient descent search (1DGDS) [10], the block‐based gradient descent search (BBGDS) [11], the four‐step search (4SS) [12], the diamond search (DS) [13], the cross‐diamond search (CDS) [14] and the hexagon‐based search (HEXBS)[15] and etc. ; however, most of them have several drawbacks in common. In one hand, they may be trapped into a local minimum search point and cannot find the best motion vectors. On the other hand, in view of hardware design, because the search flows of them are not regular, it will decrease the hardware utility and throughput. Last but not the least, these irregular search flow will result in low data reusability, low hardware throughput and raise the memory bandwidth required.
21
3.5. Review of Skip Mode Detection
In MPEG‐4 AVC/H.264 video coding, IME and FME contributes a lot for coding efficiency due to new techniques such as variable block size and six‐tap interpolation filter. However, these new complex techniques make ME dominate the computational loading and power of the whole encoding process, up to 96% [16]. The most efficient way to lower the complexity and power of ME is to directly skip the MB encoding and simply denote it with skip mode if the encoding situation is allowed. Therefore, if we can predict the skip mode before ME, we can skip the whole coding stage and save encoding power of these skipped MBs. In H.264 /AVC, if the following conditions are matched, the MB will be skipped without encoding the motion vector and residuals and just is denoted as skip mode: 1. The chosen block type is 16x16. 2. The best motion vector equals the predicted motion vector (MVP). 3. The chosen reference frame is the previous frame. 4. All coefficients are zero after transform and quantization.
3.5.1. Lagrangian Cost Motion Estimation
In [17], it proposes a skip prediction through Lagrangian cost estimation. The paper use a Lagrangian rate‐distortion cost function which incorporates and adaptive model for the Lagrangian multiplier parameter base on local sequence statistics. However, the model is non‐linear; therefore it is not suitable for hardware implementation in an efficient way.22
3.5.2. All Zero DCT Blocks Detection
In [18] and [19], they perform a comprehensive analysis of the dynamic properties of the DCT and quantization in H.264. They use several partial SADs in a 4x4 block to predict the zero blocks in variety of conditions. Although it is quite precisely, it is not suitable for hardware implementation because these partial SADs cannot acquire in an efficient way and the algorithm has too much different condition branches.
Base on these problems, in this thesis, we will introduce low cost adaptive skip mode detection and accurately predict the skip MB to save the computation complexity and energy of the motion estimation. Besides, we also propose a fast mode decision algorithm to decrease the computation loading of fractional motion estimation which will lower the pipeline efficiency of hardware design.
23
4. Low Cost Hardware Friendly Adaptive Skip Mode
Detection
In this chapter we present a low cost skip mode algorithm and its architecture by detecting 4x4‐zero block numbers in a macroblock (MB). With this simple zero block detection and an adaptive threshold, the proposed algorithm can pre‐skip 82.39% of total MB encoding and saves 77.13% of coding time for low motion CIF sized sequences with QP=36. Compared with the reference software [20], we can achieve similar quality because of the high accuracy in the skip mode detection. Due to the simplicity of this hardware friendly algorithm, the hardware cost is just 0.63K gate counts for 100MHz clock frequency. With this algorithm, we can efficiently skip the power hungry motion estimation and intra prediction and thus it can be applied to the power constrained mobile devices.
4.1. Introduction
In MPEG‐4 AVC/H.264 video coding, a lot of new complex techniques make ME dominate the computational loading and power of the whole encoding process, up to 96% [16]. Thus, speedup with VLSI circuits or fast algorithms is necessary. In which, the most efficient way is to directly skip the MB encoding and simply denote it with skip mode if the encoding situation is allowed. Therefore, if we can predict the skip mode before ME, we can skip the whole coding stage and save encoding power of these skipped MBs. This situation could often happen in many low motion sequences, and thus we can save over 80% of ME for these low motion MBs.
However, the conventional flow as in reference software JM9.0 [20] will still do the MB encoding to detect if it can skip them. In reference software, the skip mode will be decided after the FME finds the best motion vector and finishes the transform and
24
quantization. All these conditions require encoding for decision and thus will waste power once the MBs are skipped. Thus, if the skipped MB can be predicted before ME, the MB can directly go to entropy coding.
Several approaches have been proposed. In [17], it uses an adaptive threshold for different types of MB. In [18], they use the relationship of 4x4 integer discrete cosine transform (DCT) and partial sum of SAD (sum of absolute difference) to predict the zero block. In [19], they derive a more conservative and thus accurate threshold to predict the SAD of one 4x4 zero block. However, the condition of the threshold is too conservative, and thus could miss many opportunities to skip them. In [21], it uses the total SAD of one MB to predict skip MB, but it is not accurate because it will miss a lot of possible zero MBs in our analysis. In [22], they use an open‐loop adjustable threshold that is not robust to have consistent performance. Besides, most of these approaches are not regular and thus are not easy and efficient for hardware design.
To overcome these disadvantages mentioned above, we propose a low cost hardware friendly close‐loop algorithm and its architecture. The concept of our approach is that probability of a zero MB is highly depended on its contained zero 4x4‐block numbers. Thus, we use a SAD threshold to detect a 4x4 zero block, and an adaptive threshold of the number of 4x4 zero block in a MB to decide a zero MB. Furthermore, we remove the exceptional cases with a spike threshold. With this, we can achieve higher detection with accurate prediction.
25
4.2. The Fast Skip Algorithm
The whole algorithm is illustrated in Figure 10. We first detect whether a 4x4‐block is zero or not by a 4x4‐block‐SAD‐threshold. We count the number of zero 4x4‐blocks in a MB. If the number is larger than an adaptive MB‐zero‐block‐threshold, we will denote this MB as a zero MB and skip its encoding. To avoid above SAD threshold affected by local large variations, we adopt a Spike‐threshold to remove such cases for more accurate detection. Figure 10 the skip detection flow
4.2.1. 4x4‐block‐SAD‐threshold
This threshold is used to decide if a 4x4‐block is zero. We determine this by analyzing the distribution of the 4x4‐block SADs higher than the must‐be‐zero‐block‐threshold [19] ‐‐ we call it T0‐‐ but also quantized to zero block in skipped MBs. We use five 100‐frame CIF26
sized test sequences to determine this threshold as shown in TABLE II. In which, the “mean”, ”variance”, ”maxima” stand for the average, standard deviation and maxima values of these 4x4‐blocks whose SADs are higher than T0. The boundary of the 4x4‐block‐SAD‐threshold is the summation of mean and variance.
From this table, we can find that almost 85.9% in average of the 4x4‐block SADs in one skip MB is lower than the boundary. When the SAD of the 4x4‐block is less than the boundary, we consider the 4x4‐block as a zero block. Therefore, we choose the minimum one of the five sequences as the 4x4‐block‐SAD‐threshold to prevent from large prediction error. The 4x4‐block‐SAD‐threshold under different QPs is shown in TABLE III. TABLE II boundary determination of QP28 (Mean, variance, boundary and maxima for the 4x4‐block SAD distribution which higher than T0 when QP 28)
akiyo mother foreman football silence
Mean 43 45 45 48 55 Variance 10 9 10 10 12 Boundary 53 54 55 58 67 Maxima 100 97 117 97 111 TABLE III the 4x4‐block‐SAD‐threshold and Spike‐threshold under different QP QP20 QP24 QP28 QP32 QP36 4x4‐block‐SAD‐threshold 21 34 53 82 125 Spike‐threshold 36 63 97 160 231
4.2.2. MB‐zero‐block‐threshold
In the reference software, we can only decide the MB as a skip MB when the MB has sixteen 4x4‐zero‐blocks. However, we should consider more about the characteristic of the skip MB. The skipped MB always has spatial and temporal correlations. For example, as in Figure 11, when the MB belongs to the background such as the wall, it is likely that
27
the neighboring MBs are also background due to spatial correlation. Therefore, we record the zero block numbers as the MB‐zero‐block‐threshold if the MB is skipped; otherwise we record the MB‐zero‐block‐threshold as 16. Figure 11 the skipped MB in the sequence “table” in frames 171 and 172 (Light‐colored blocks are skipped MB)
For the hardware pipeline consideration, the current(C) MB‐zero‐block‐threshold is determined by the minimum threshold of upper (U) and upper‐left (L) and upper‐right (R), as depicted in Figure 12. The use of (L) is to avoid the “read‐after‐write” data hazard in the pipelined hardware design. Figure 12 the MB‐zero‐block‐threshold prediction C=min {U, L, R}.
28
4.2.3. Spike‐threshold
Due to its adaptive adjustment, the MB‐zero‐block‐threshold could be decreased by the neighboring MB, such as to 13. However, this could lead to a case that the number of zero blocks exceeds the threshold but also contains blocks with large SAD values. In such case, the MB should not be skipped but will be detected as the skipped. For example, in Figure 11, there is one ping‐pong ball flying from the left bottom corner in the 172 frame. For this case, the MB‐zero‐block‐threshold could be decreased to 13 and this MB has 14 zero blocks that exceeds the threshold. However, it also has two 4x4‐blocks with large SADs. In this situation, this MB will be skipped by examining the threshold only, and thus cause error prediction. Thus, we must set up a threshold to detect if there are any large 4x4‐block SAD ‐‐ we call such 4x4‐block as a ‘spike’. The spike‐threshold is determined by the minimum one among the six ‘maxima’ values in TABLE II. When QP is equal to 28, the spike‐threshold is 97. The spike‐threshold under different QPs is illustrated in TABLE III.
4.3. The Fast Skip Architecture
Figure 13 the system architecture29
Figure 13 shows the system architecture of our design. The gray colored block is the added hardware by our pre‐skip design. The skip detection module is at the beginning stage and combined with the IME module. At each time, the IME starts searching at the MVP and gets the sixteen 4x4‐block SADs for the skip detection module. The detection module then compares these SADs and 4x4 zero block count with the
4x4‐block‐SAD‐threshold, MB‐zero‐block‐threshold and Spike‐threshold to decide
whether the MVP is a skip MB or not. If it does, the skip detection module raises the “skip_flag” signal to the control logic and it will abort the rest of encoding steps. At last, the entropy coding module encodes the MB as a skipped MB and writes back the number of zero block of the skipped MB. On the other hand, if the MB at the MVP is not detected as skipped MB in the first stage, the IME continues searching the best MV in the search range and goes through the normal procedure. However, the entropy coding also has to write back the MB‐zero‐block‐threshold table. If the MB is decided as a skipped MB in the entropy coding stage, it writes back the number of zero blocks. Otherwise, it writes back the threshold as 16.
Figure 14 is the skip detection module architecture. It consists of sixteen comparators to decide whether the SADs are bigger than the 4x4‐block‐SAD‐threshold and
spike‐threshold. It also include one adder to sum up the number of zero block and one
“OR gate” to decide if the MB has spike. In addition, another comparator is used to select the minimum of MB‐zero‐block‐threshold in the neighbor MB. At last, an “AND gate” will check whether the MB is to be skipped.
30 ….. Comp 0 Comp 1 Comp 15 ADDER OR 1 1 1 1 ….. ….. 5 1 comp 1 1 4x4-block-SAD-threshold 12 Spike-threshold 12 SAD 0 12 SAD 112 SAD1512 MB-zero-block-threshold 4 AND Skip_flag Figure 14 the skip detect architecture
4.4. Experimental Result
The simulation environment is described below. First, we will show the performance under various small size test sequences. Since the number of skipped MB is highly depended on the sequence contents, we roughly partition eight 300‐frame CIF sized test sequences into three categories, low motion, medium motion and high motion sequences for more accurate evaluation. The low motion sequence are: ‘akiyo’, ‘mother_daughter’, ‘news’ and the medium motion sequences are ‘container’, ‘table’, ‘foreman’. The high motion sequences are: ‘stephan’ and ‘mobile’. This algorithm is included into the reference software. Second, we show the performance of large frame size sequences, which are 300‐frame 720p sized ‘Stockholm’ and ‘mobcal’.
The test environments are: baseline profile, no rate distortion optimization, one reference frame, search range is equal to 16 and 32 for CIF and 720p sized sequences respectively. The average performances of pre‐skip detection are listed in TABLE IV ‐ TABLE VII, with high QP, our design will save more bit rate with some PSNR drop. The “skip hit rate” is the ratio of pre‐skip MBs divides the number of skip MBs. All these following simulations are compared with the reference software. All these following simulations are compared with the reference software.
31 TABLE IV performance of pre‐skip detection for low motion CIF sequences PSNR (dB) Bit‐rate (%) Skip hit rate (%) QP20 ‐0.02 ‐1.79 92.10 QP24 ‐0.10 ‐1.56 93.44 QP28 ‐0.13 ‐2.86 94.82 QP32 ‐0.21 ‐5.70 94.53 QP36 ‐0.28 ‐8.65 96.56 TABLE V performance of pre‐skip detection for medium motion CIF sequences PSNR (dB) Bit‐rate (%) Skip hit rate (%) QP20 0.00 ‐0.02 85.71 QP24 ‐0.09 ‐0.66 85.86 QP28 ‐0.13 ‐1.95 88.96 QP32 ‐0.21 ‐4.68 91.13 QP36 ‐0.36 ‐7.12 94.77 TABLE VI performance of pre‐skip detection for high motion CIF sequences PSNR (dB) Bit‐rate (%) Skip hit rate (%) QP20 0.01 0.17 86.15 QP24 0.01 0.37 92.82 QP28 0.01 1.08 91.70 QP32 0.03 1.27 86.09 QP36 ‐0.02 0.01 86.93 TABLE VII performance of pre‐skip detection for 720p sequences PSNR (dB) Bit‐rate (%) Skip hit rate (%) QP24 0.03 2.20 86.85 QP28 0.05 4.47 89.75 QP32 ‐0.06 ‐0.32 89.31 QP36 ‐0.20 ‐3.38 90.72
32
Figure 15 shows the RD curve of the low motion sequence. The curve of our design is almost the same as the original curve, and sometimes is slightly better than the original curve under high QP because we skip more MBs than JM9.0. These skipped MBs are nearly to be skipped so it does not degrade performance significantly. Figure 17 shows the results of medium motion sequences. Our design has almost the same curve as original curve. Figure 19 reveals the result for high motion sequence; because there are few MBs which are skipped, the performance is almost the same as the original curve. Figure 21 is the RD curve of 720p sequences, it shows slight quality drop because the high definition characteristic of large size sequences are much difficult to correctly predict and an error predicted skip MB will also cause more serious quality drop. Figure 16, Figure 18, and Figure 20 are the average distribution of not skipped MB, and the skipped MB for different categories CIF sized sequences. Besides, the distributions of 720p sequences are depicted in Figure 22. The portion of skipped MB consists of three types: error predict, under skip and correct skip. “Bad skip” means the MB is pre‐skipped but it should not be skipped. “Miss skip” means the MB should be pre‐skipped but it is not detected in the pre‐skip stage by our design. “Correct skip” means we can accurately pre‐skip the MB. We can see that the higher QP case will skip more MBs, up to 82.39 % for low motion and 65.88% for medium motion sequence in average. On the other hand, we find that the “error predict” does not degrade the performance a lot because these error predicted MBs are nearly to be skipped.
30 32 34 36 38 40 42 44 46 0 PS NR Figure Figure 16 th 100 2 e 15 the RD he distribut 200 300 33 curve of lo tion of MB f 0 bit rate400
low
ow motion s for low mo 500 e sequences tion sequen 600 akiyo akiyo moth moth new new 8 nces 700 80 o_orig o_pre_skip her_orig her_pre_skip s_orig s_pre_skip 82.39 00Fig 30 32 34 36 38 40 42 44 0 PSNR Figure 17 gure 18 the 50 7 the RD cu distributio 00 34 urve of med n of MB for 1000bit rat
mediu
dium motio r medium m 15 em
n sequence motion sequ 00 conta conta table table forem forem es uences 2000 ainer_orig ainer_pre_ski e_orig e_pre_skip man_orig man_pre_skip 65.88 p pF 25 27 29 31 33 35 37 39 41 43 0 PSNR Figure Figure 20 th 1000 19 the RD he distribut 2000 35 curve of hig tion of MB f 0 bit rate30
high
gh motion s for high mo 000 sequences otion seque 4000 stef stef mo mo ences 5000 fan_orig fan_pre_skip bile_orig bile_pre_skip pPSNR 30 31 32 33 34 35 36 37 38 39 0 PSNR Fig Figure 2 5000 gure 21 the 22 the distri 10000 36 RD curve o ibution of M 0 150 bit ra