中華大學

(1)

中華大學碩士論文

題目：以MB在時間上的關係為基礎用於H.264 的快速 Inter模式決策演算法

Fast Inter Mode Decision Algorithm Based on MB Temporal Correlation in H.264

系所別：資訊工程學系碩士班學號姓名： M09302041 楊士賢指導教授：鄭芳炫教授

中華民國九十六年七月

(2)

I

摘要

在 H.264/AVC 的畫面間編碼(inter-coding)預測之中，多重區塊大小的選擇是一項重要的技術。標準中所制定的七種區塊大小將一個巨區塊(macroblock, MB)切割成幾個部份。當不同的物件包含在同一個巨區塊的時候，較小的區塊比較能夠達到更好的動態預測結果。然而，當所有的區塊大小都被選取為決定最佳區塊大小的判定時，這項技術將會造成巨大的運算量。本篇論文提出一個快速的 Inter 模式決策演算法來減少所需考慮的模式數量，使得運算時間也能夠同時下降。我們使用在前一張畫面同位置的巨區塊以及它的鄰居作為判斷的候選者。接著指定每種模式一個各別的分數來檢查是否有一條移動物件的邊緣經過這些候選者。實驗結果顯示我們的快速模式決策演算法可以在平均降低 0.05dB 的 PSNR 以及 2%的位元率增加之下達到降低 31%-41%的整體編碼時間以及 41%-54%的動態預估時間。

(3)

II

致謝

首先要感謝指導老師鄭芳炫教授在這兩年以來的諄諄教誨、悉心指導，以及在研究的過程中給予寶貴的意見，使我獲得許多助益與成長，並且讓我能夠完成研究所的學業。謹在此致上最誠摯的感謝。

同時要特別感謝父母的栽培與關心，有了他們的鼓勵與支持，以及日常生活的資助，才能讓我無後顧之憂的專心於研究的工作。

最後要感謝的是在作研究的時光中，給予我不少幫助的學長鄭致傑、程泓翔、郭育銘；和我ㄧ起努力的同學張政暉、楊志強、陳弘裕、楊長慎，學弟呂長鴻、莊凱煒、王文長、莊明傑、徐文良，以及學妹梁韻卉；還有新進的學弟和隔壁實驗室的同學們。有了他們，才使我在這段人生中過得更加充實。

僅以此論文獻給我由衷感謝的親人、朋友及同學們！

(4)

III

目錄

摘要...I 致謝... II 目錄... III 附圖目錄... V

第 1 章序論... 1

1.1 視訊編碼層的回顧... 1

1.2 在H.264 中主要的新技術... 2

1.2.1 多重區塊大小的動態預估技術... 2

1.2.2 多畫面參考... 2

1.2.3 畫面內預測... 3

1.3 動機與目的... 3

第 2 章在H.264 中的模式決策流程與相關研究... 4

2.1 多重區塊大小的動態預估... 4

2.1.1 SAD與PDE... 4

2.1.2 螺旋搜尋順序與運動向量預測... 4

2.1.3 在H.264 中的成本計算方式... 5

2.2 位元率與失真度的最佳化... 5

2.3 相關研究... 6

第 3 章提出的演算法... 7

第 4 章實驗結果... 11

第 5 章結論與未來展望... 12

(5)

IV

5.1 結論... 12 5.2 未來展望... 12

(6)

V

附圖目錄

圖表 1-1 多種區塊大小與其對應的模式編號... 3

圖表 3-1 前一張畫面中同位置的巨區塊以及其鄰居... 8

圖表 3-2 每個候選者的編號... 8

圖表 3-3 四個方向性的巨區塊集合... 9

圖表 3-4 演算法的限制... 10

(7)

1

第 1 章序論

視訊壓縮用在影像的交流、傳輸以及儲存上扮演一個重要的角色。為了這個目的，

更有效率的視訊壓縮標準一直的被提出以符合現今對視訊不斷增加的需求。H.264 是由 ITU-T 的 Video Coding Expert Group (VCEG)和 ISO/IEC 的 Moving Picture Expert Group (MPEG)這兩個組織成立的 Joint Video Team (JVT)所制定最新的視訊壓縮標準。他包含兩個部份，分別是視訊編碼層以及網路概念層。前者用來提升視訊編碼效率，後者則提供一個合適的網路服務介面。

1.1 視訊編碼層的回顧

一個視訊影片包含一連串已壓縮的畫面。影片的第一張或是隨機存取點的畫面通常以畫面內預測(intra-prediction)的方式來做壓縮。而其他的畫面則使用畫面間預測 (inter-prediction)的方式來做壓縮。Inter 編碼技術利用畫面間的關係來消除冗餘的資訊以達到減少資料量的目的。不論是用於 Intra 或 Inter ，剩餘資訊 (residual-information)都被用來當作編碼和傳輸的對象。而剩餘資訊即為原始資料以及預測資料之間的差異。

為了影像壓縮的方便，視訊影片中的每張畫面都被以固定大小切割為一個一個的巨區塊(macroblock, MB)。在普遍使用的 4:2:0 的取樣方式下，每個巨區塊包含一組 16x16 個亮度的樣本所成的區塊以及兩組 8x8 個彩度的樣本所成的區塊。不論是亮度或彩度的樣本都會被用來實行空間上(spatial)或時間上(temporal)的預測，並將剩餘資訊編碼。

以預測方式來對每張畫面做區分，可分為 I 畫面、P 畫面以及 B 畫面。在 I 畫面中的每個巨區塊都必須以畫面內預測的方式來壓縮。在 P 畫面中，每個巨區塊可以對前一張畫面做畫面間的預測，也可以選擇做畫面內的預測。而在 B 畫面中，除了 P 畫面的功

(8)

2

能外，還可對時間軸上位於後面的畫面做預測來達到更高的壓縮率。在 P 與 B 畫面中主要是以消除畫面間冗餘的資訊為主，但是當畫面間的變化超過一定程度時，選用畫面內預測才能夠保持連續影像的品質。

1.2 在 H.264 中主要的新技術

和以前的標準相比，H.264 能達到更高的壓縮效能是由於納入了幾個新的技術。例如，多重區塊大小的動態預估、多張畫面參考、四分之一像素的動態預估、方向性的畫面內預測、去區塊化的濾波器、整數離散餘弦轉換、與內容相關的算數編碼法等。

1.2.1

多重區塊大小的動態預估技術

H.264 將一個巨區塊用多種不同的大小來切割。圖表 1-1 顯示各種切割方式與其對應的模式編號。從圖中可以看出，模式 1 使用完整的 16x16 的區塊大小，意味著只有一個區塊被用於這個模式。模式 2 和 3 將巨區塊切割為等分的兩個部份，大小分別為 16x8 和 8x16。換句話說，模式 2 將巨區塊切為上下兩部分，模式 3 則切為左右兩部分。在模式 P8x8 之中，巨區塊被分為四個 8x8 的區域。每個區域可再以更小的區塊大小做切割，

例如 8x8、8x4、4x8、4x4，因此每個區域將被各別考慮，彼此間為獨立的關係。由此我們可以將一個巨區塊的切割模式分為兩個層級，一個是所謂巨區塊層級(macroblock level)，另一個則是次區塊層級(sub-macroblock level)。在巨區塊層級中包含四個模式以及一個特殊的 skip 模式，由圖可知 skip 模式使用與模式 1 相同的大小。在次區塊層級中則包含模式 4 到模式 7 四種，而四個次區塊將會進行相同的模式選擇，以決定出各別最佳的模式。

1.2.2

多畫面參考

有些時候最佳的預測結果並非落在前一張畫面。這是因為物體有可能進行反覆的運動，或是拍攝畫面的晃動所造成。此時，透過多張畫面參考讓更前面的畫面也能夠進行

(9)

3

動態預估的判斷，將會得到一個更好的預測結果。

1.2.3

畫面內預測

畫面內預測意即利用同一張畫面上的資料對目前的巨區塊作預測。同一張畫面上的資料通常是在該巨區塊附近且已經被編碼過和重建之後的資料。H.264 將一個巨區塊以 16x16 和 4x4 兩種大小來做畫面內預測，其中 4x4 即為一個巨區塊包含 16 個較小的區塊。16x16 擁有四種方向性的預測方式，分別為垂直、水平、平均以及對角線的方向。

每個預測方式利用附近的資料填成一個 16x16 的區塊後，根據與原始資料的差異決定出最好的結果。4x4 則擁有九種方向的預測方式，除了上述的四種之外，另有斜對角線的預測方向可供選擇。

圖表 1-1 多種區塊大小與其對應的模式編號 1.3 動機與目的

根據上面所述，H.264 雖然能夠達到更高的壓縮效率，但是計算量卻有非常可觀的增加。這對一個低計算能力的系統將是一個嚴重的負擔，尤其在一個需要即時運算的環境下更是難以實現。而畫面間的預測仍然佔據了大部分的運算量，以此為由，本篇論文提出一個快速演算法，用以減少需要被檢測的模式，在些微的效能降低之下，達到大量時間上的縮短。

(10)

4

第 2 章在 H.264 中的模式決策流程與相關研究

對於一個巨區塊的編碼流程可以大略分為兩個部份，一個是多重區塊大小的動態預估，另一個是位元率與失真度的最佳化。其中前者的目的在於取得所有模式中每個區塊的運動向量，後者的目的則是選取一個最佳的模式作為編碼的結果。

2.1 多重區塊大小的動態預估

全域搜尋(Full Search)是一個被大多數視訊壓縮標準所採用成為基礎的動態搜尋演算法。他尋訪在一個搜尋區域內的所有位置以得到最佳的預測結果。在 H.264 之中，

除了基本的絕對差值總和(Sum of Absolute Difference, SAD)之外，全域搜尋法有幾項主要的改進方法，包含部份差異排除(Partial Distortion Elimination, PDE)、螺旋搜尋順序(Spiral Search Order)以及動態向量預測(Motion Vector Predictor)。

2.1.1

SAD 與 PDE

SAD 的目的是計算某一位置其所成的區塊與目前欲求動態向量的區塊之間的像素值差異的總和。計算公式如下所示：

∑

=

−

=

^X^Y

y x

y

x

y mv

mv x r y x s r

s SAD

,

1 , 1

) ,

( ) , ( )

,

(

_。

在傳統的做法中，某一位置的 SAD 必須全部計算完畢再做比對。在 H.264 之中的改進為每計算到一定程度便做一次比對。一旦累積的 SAD 值已經超過一個最小值時，剩餘的部份就可以省略而達到時間的節省。換句話說，僅部分 SAD 值的計算便可決定是否該排除多餘的計算量。這項技術即為 PDE。

2.1.2

螺旋搜尋順序與運動向量預測

由於畫面間的差異不會很大，而且彼此間的關係密切造成此運動向量多半靠近搜尋區域的中心。根據前一節所述，要讓 PDE 能夠發揮最大的效果，最主要的方式便是盡快

(11)

5

shift vector motion s

ME for position y

and x predicted py

px

ME for position y

and x candidate cy

cx

factor lambda

f where

py]) s) mvbits[(cy

px]

s) s[(cx st(f,mvbit WeightedCo

MV_COST

: : ,

: , :

,

−

<<

+

找到整個搜尋區域的最小值。若能越早找到這最小值，其後的位置所排除的計算量就會愈大。因此，由中心點開始向外擴展的螺旋狀搜尋方式搭配 PDE 技術使得計算量得到更進一步的節省。

運動向量預測是一個決定搜尋中心點的技術。由於相鄰的巨區塊間彼此的運動方式有很高程度的相似，因此將搜尋中心點往可能的方向移動，同樣能夠獲得越快到達最佳位置的效果。

2.1.3

在 H.264 中的成本計算方式

在 H.264 中提供三種成本(cost)的計算方式，分別為計算運動向量的成本、多畫面參考的成本以及位元率與失真度的成本計算。運動向量的計算目的在於將向量的長度加入考量之中。較長的向量意即離中心較遠的向量由於必須花費比較多的位元數用於編碼上，意味著該向量有較高的成本。同樣的意義在多畫面參考上，大於一張的參考畫面必須透過額外的資訊來記錄，更甚者參考較遠的畫面其所使用的資訊量必定大於參考較近的畫面，其成本也就越大。以下列出運動向量的計算公式：

<<

= −

而判斷最佳向量的計算則為 SAD 加上該候選向量的成本：

COST MV

SAD

J

_motion = + _

2.2 位元率與失真度的最佳化

在獲得所有模式中每個區塊的向量之後，編碼器就可以針對每個模式重建出預測的結果。重建區塊的流程即為根據向量所指的位置取得剩餘資訊，再經過離散餘旋轉換以

(12)

6

及量化的運算，之後將結果依反向流程轉換回來並加上預測位置的資料即所得。計算其與原始資料間的差異即為失真度。編碼器計算每個模式的位元率與失真度並選擇擁有最低成本的模式為最後編碼的結果。其中位元率代表在某個模式下最後編碼所需花費的位元長度。成本計算的公式如下：

Rate SSD

J

_mod_e = +

λ

⋅

其中 SSD 為失真度，以平方差的方式對亮度以及兩個彩度的資訊做運算，公式如下：

∑

=

− +

−

=

8 , 8

1 , 1

2 8

, 8

1 , 1

2 16

, 16

1 , 1

2

]) ,

, [ ] , [ (

]) ,

, [ ] , [ (

]) ,

, [ ] , [ (

) ,

, (

y x

V V

y x

U U

y x

Y Y

QP MODE y

x c y x s

QP MODE y

x c y x s

QP MODE y

x c y x s

QP MODE c

s SSD

2.3 相關研究

在[9]中，他們將巨區塊的內容區分成兩種，一種是同質性的區塊，一種是靜止的區塊。靜止的區塊意即在前後張畫面中幾乎沒有移動。而同質性的區塊意即區塊內的材質較為的平滑，即使是運動物體也多半不會被切割成比較小的區塊。靜止性的區塊判斷方式是直接將巨區塊的像素與前一張畫面同位置的資料相減，若小於一個臨界值則為靜止性的區塊，反之則否。被判定為靜止性的區塊僅使用最大的區塊而將較小的區塊排除。同質性的區塊透過 Sobel 的邊緣偵測技術來實行，若區塊內邊緣的強度較大則不是一個同質性區塊且使用全部的模式，反之，被判定為同質性的區塊則僅使用較大區塊的模式，例如 16x16、16x8 以及 8x16，而將擁有較小區塊的模式關閉。

(13)

7

第 3 章提出的演算法

針對七種區塊大小來考慮，我們可以觀察得到較小的區塊通常出現在一個運動物件的邊緣。這是因為巨區塊是固定 16x16 的大小，且通常不會只包含一個物件在裡面，例如一個物件和背景包含在同一個巨區塊中。若物體的運動更加的複雜，而半分一個巨區塊的模式(模式 2 或 3)不夠使用時，小的區塊將會得到較佳的預測結果。

類似於[9]中的想法，當一個區域並非靜止且內容的材質複雜時，該區域應該被切割為較小的區塊。換句話說，一個巨區塊若屬於一個移動物件的邊緣，那麼他應該被切割為較小的區塊，即使物件邊緣有可能剛好落在巨區塊的其中一邊而使得模式 2 或模式 3 成為較佳的切割方式。不過由於自然界中的形體通常是不規則的，使得這並非經常發生。總之，若我們能辨別出一個巨區塊擁有移動物件的邊緣，則我們可以決定出該巨區塊應該使用較小的區塊模式，例如 P8x8。

根據一般視訊影片的特性，畫面間的差異是很小的。這意味著在前一張畫面中同位置的巨區塊有非常高的相關性，而我們可以利用它來預測目前的巨區塊。為了加強模式預測的準確度，我們不僅使用前一張畫面同位置的巨區塊，而且包含其周圍的鄰居都納入考量的範圍。圖表 3-1 中的＇C＇即為在前畫面中同位置的巨區塊，其相鄰的八個鄰居同樣成為候選。

為了敘述上的方便，我們對其九個候選者指定各別的編號。圖表 3-2 顯示＇5＇即為前畫面同位置的巨區塊。為了找出物件邊緣，我們利用四種 pattern 來檢查＇C＇是否在移動物件的邊緣上。圖表 3-3 顯示這樣的結果。每三個巨區塊收集而成為一個巨區塊的集合，最後可得到四組巨區塊的集合，每一組即代表一種邊緣的方向。而且＇C＇

必定是他們的中心。

考慮圖表 3-3 中的(A)，候選巨區塊＇4＇,＇5＇,和＇6＇被收集成為一個代表著水

(14)

8

平方向的集合。我們可以利用這個集合來判斷＇C＇是否經過一個水平方向的物件邊緣。若是，則目前的巨區塊也非常有可能落在邊緣上。此外，(B),(C)和(D)同樣代表了經過垂直、左上到右下、右上到左下的方向。

圖表 3-1 前一張畫面中同位置的巨區塊以及其鄰居

圖表 3-2 每個候選者的編號

為了找出哪一組巨區塊集合是真正的物件邊緣，我們對每個模式指定一個分數。接著，每個集合中的三個巨區塊其所屬的模式根據這個分數做一個加總並放置在一個變數＂center_value＂中。每一組集合會得到自己的 center_value。最後，選出最大的

1 2 3

4 5 6

7 8 9

C

(15)

9

center_value 並檢查其是否超過一個臨界值。若是，則我們可以認為該集合是物件的邊緣並使用較小的區塊。同時，較大的區塊例如模式 1,2 和 3 則被關閉，僅使用模式 P8x8。

表格 3-1 顯示所有模式所給定的分數。

圖表 3-3 四個方向性的巨區塊集合

另一方面，若最大的 center_value 非常小，也就是說這些鄰居都是較大的區塊，

甚至是 skip 模式，則我們可以認為他們應該都在靜止的區域或是背景上。同時，更多的模式就可以被省略。我們設計另一個臨界值來做這項檢查。

這兩個臨界值稱為 upper_thd 和 lower_thd。前者用來區分大的區塊與小的區塊，

其中大的區塊代表 16x16,16x8 和 8x16，而小的區塊則是 P8x8。後者用來區分較大的區塊中僅使用 16x16 的模式。根據實驗的結果，upper_thd 設為 8 而 lower_thd 設為 2 可得到最佳的結果。這表示若目前的巨區塊被分為 P8x8，則 center_value 最大值所屬的集合其三個候選巨區塊必定有兩個或兩個以上的 P8x8 模式。

C

(A) (B)

(C) (D)

(16)

10

mode score 0 0 1 1 2 2 3 2 P8x8 4

9 (intra) 4

10 (intra) 4

表格 3-1 模式與對應的分數

最後，我們的方法有幾個限制。在 I 畫面之後的第一個 P 畫面中，所有的巨區塊都必須使用所有的模式。因為所有的候選巨區塊都在前一張的 I 畫面中。另外，在畫面外圈的巨區塊也必須使用所有的模式。因為其候選巨區塊不足以用來決定模式的預測。圖表 3-4 中淺灰色的部份表示必須使用所有模式的巨區塊。

圖表 3-4 演算法的限制

(17)

11

第 4 章實驗結果

我們將提出的演算法實作在 H.264 的參考軟體上[5]。七個視訊短片被選為測試的對象。這七個影片都是 QCIF 格式，並對前一百張畫面進行壓縮結果的測量。檢測的項目有三種，分別為訊噪比(PSNR)、位元率(Bitrate)和時間的差異，列於英文論文的 Table 4.2。其中時間的差異位於上方的是整體的編碼時間，下方的是動態預估的時間。作為比較用的方法是由 Wu 等人[9]以及 Ahmad 等人[10]所提出的演算法。

reference

proposed

PSNR

PSNR dB

PSNR

= −

Δ ( )

透過觀察可以發現，我們所提出的演算法可以達到超過三成的整體時間的下降。而 Wu 等人的方法雖然在品質及位元率上有較為不錯的結果，但是在時間的縮短上與我們的方法有不小的差距。原因在於 Wu 等人的演算法所檢查的模式較多，且計算邊緣強度花費不少時間。

最後我們計算每個巨區塊平均使用的模式數量並列於英文論文的 Table 4.3。

% 100

(%) − ×

= Δ

reference

reference proposed

Bitrate

Bitrate Bitrate

Bitrate

% 100

(%) − ×

= Δ

reference reference proposed

Time

Time Time

(18)

12

第 5 章結論與未來展望

5.1 結論

本篇論文提出一個利用巨區塊在畫面間的關系的快速 Inter 模式角色演算法。由於畫面間的微小差距，前一張畫面同位置的巨區塊以及其相鄰的鄰居都被選為作為預測目前巨區塊的模式的候選者。這個快速演算法檢查最多三個模式並達到 31%-41%的整體時間下降。

5.2 未來展望

未來的工作可以朝幾個目標邁進。第一、演算法中使用的兩個臨界值是固定的，這可能無法完全適用在每個不同的影片中，因為每個影片的動態是不同的。第二、方法中僅使用候選巨區塊的模式作為判斷目前區塊的依據，若能增加更多的特徵進來考慮，或許能夠提升模式判斷的準確率。

(19)

英文論文本

(20)

Fast Inter Mode Decision Algorithm Based on MB Temporal Correlation in H.264

Advisor：Fang-Hsuan Cheng Student：Shih-Shien Yang

August, 2007

A Thesis Submitted to the Chung-Hua University Hsin-Chu, Taiwan 300

Republic of China

(21)

I I

Abstract

Variable block size used for inter coding is one of the key technologies in H.264/AVC. These seven block sizes partition a macroblock into several parts. When different objects contained in the same macroblock have different motions, smaller block sizes probably achieve better predictions. However, this feature results in extremely high computational complexity when all the block sizes are considered to decide a best one. This paper proposes a fast inter mode decision algorithm to reduce the numbers of inter mode that has to be checked, and the time is also reduced. We use the co-located macroblock in previous frame and its neighbors as candidates. And check whether an edge of moving object is crossing the middle of these candidates by using the score given to the modes. The experimental results show that the fast inter mode decision algorithm is able to reduce 31%-41% total encoding time and about 41%-54% motion estimation time with a negligible PSNR loss of 0.05 dB and bit-rate increment of 2% on the average.

(22)

II II

Outline

Abstract ... I Outline... II List of Figures ...IV List of Tables...VI Chapter 1 INTRODUCTION ...1 1.1 Overview of The Video Coding Layer...2 1.2 Main Innovative Features in H.264/AVC ...4 1.2.1 Variable Blocks Size Motion Estimation (VBSME)...4 1.2.2 Multiple Reference Frame ...6 1.2.3 Intra Prediction...7 1.3 Motivation...8 Chapter 2 Mode Decision in H.264 and Related Works...10 1.1 VBSME...10 1.1.1 SAD and PDE ... 11 1.1.2 Spiral Search Order and MVP ...12 1.1.3 Cost Calculation in H.264...15 1.2 RDO ...17 1.3 Related Works ...18

(23)

III III

Chapter 3 Proposed Method ...21 Chapter 4 Experimental Results ...29 Chapter 5 Future Work and Conclusion ...38 1.4 Conclusion ...38 1.5 Future Work ...38 REFERENCES ...39

(24)

IV IV

List of Figures

Figure 1.1 Variable block sizes and corresponding mode number ...5 Figure 1.2 The advantage of multiple frame reference...6 Figure 1.3 Motion estimation with multiple reference frames ...7 Figure 1.4 Intra_16x16 prediction modes...8 Figure 1.5 Intra_4x4 prediction modes...8 Figure 2.1 Partial Distortion Elimination... 11 Figure 2.2 Spiral Search Order [7]...13 Figure 2.3 The distributions of the MV displacement for the video

sequences [7], “Container” (A), “Mobile” (B). ...13 Figure 2.4 Motion Vector Predictor ...14 Figure 2.5 The five candidates in [10] ...19 Figure 2.6 Homogeneous and stationary blocks ...20 Figure 3.1 The co-located macroblock and its neighbors are candidates. ...22 Figure 3.2 The number of each candidate...23 Figure 3.3 The four directional MB sets ...24 Figure 3.4 The limitation of algorithm ...28 Figure 4.1 ( a ) RD curve of Akiyo ...34 Figure 4.1 ( b ) RD curve of Coastguard ...34

(25)

V V

Figure 4.1 ( c ) RD curve of Container ...35 Figure 4.1 ( d ) RD curve of Foreman ...35 Figure 4.1 ( e ) RD curve of Mother & Daughter ...36 Figure 4.1 ( f ) RD curve of Silent ...36 Figure 4.1 ( g ) RD curve of Suzie...37

(26)

VI VI

List of Tables

Table 4.1 The total encoding time and motion estimation time...30 Table 4.2 Changes of PSNR, Bit rate and time...32 Table 4.3 Average processed mode per macroblock ...33

(27)

1 1

Chapter 1 INTRODUCTION

Video Compression plays an important role in digital video communication, transmission and storage. A sequence of video coding standards are proposed to serve it, such as H.261, MPEG-1 Video, MPEG-2 Video, H.263 and MPEG-4 part 2: Visual.

These previous standards reflect the technological progress in video communication and the adaptation of video coding to different applications and networks. Application areas today range from videoconferencing over mobile TV and broadcasting of high-definition TV content up to very-high-quality applications such as professional digital video recording or digital cinema/large-screen imagery. Prior video coding standards are already established in parts of those application domains. But with the proliferation of digital video into new application spaces, the requirements for efficient representation of video have increased greatly and previously standardized video coding technology can hardly keep pace. For the point of view, a new video coding standard focused on high coding efficiency is established.

H.264/AVC [1]-[4] is the latest video coding standard developed by the JVT (Joint Video Team) of ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Expert Group (VCEG). While the H.264 belongs to H.26L family of VCEG and the AVC (Advanced Video Coding) belongs to MPEG-4 part 10. This

(28)

2 2

standard has been designed in order to provide higher coding efficiency and network adaptation, which includes a Video Coding Layer (VLC) and a Network Abstraction Layer (NAL). While the VCL represents the video content, and the NAL provides a network-friendly interface.

1.1 Overview of The Video Coding Layer

The video coding layer in H.264/AVC is similar in spirit to that of other video coding standards [3]. A coded video sequence in H.264/AVC consists of a sequence of coded pictures. A coded picture can represent either an entire frame of a single field, as was also the case for MPEG-2 video. The typical encoding operation for a picture begins with splitting the picture into blocks of samples. The first picture of a sequence or a random access point is typically coded in Intra (intra-picture) mode without using any other pictures as prediction reference. For all remaining pictures of a sequence or between random access points, typically Inter (inter-picture) coding is utilized. Inter coding employs inter-picture temporal prediction by using other previously decoded pictures. The encoding process for temporal prediction includes choosing motion data that identifies the reference pictures and spatial displacement vectors that are applied to predict the samples of each block. The residual of the prediction (either Intra or Inter), which is the difference between the original input

(29)

3 3

samples and the predicted samples for the block, is transformed. The transform coefficients are then scaled and approximated using scalar quantization. The quantized transform coefficients are entropy coded and transmitted together with the entropy-coded prediction information for either Intra- or Inter-frame prediction.

Each picture of a video sequence is partitioned into fixed size macroblocks that each covers a rectangular picture area of 16 × 16 samples of the luminance component and, in the case of video in 4:2:0 chrominance sampling format, 8 × 8 samples of each of the two chrominance components. All luminance and chrominance samples of a macroblock are either spatially or temporally predicted, and the resulting prediction residual is represented using transform coding.

The macroblocks are organized in slices, which represent regions of a given picture that can be decoded independent of each other. There are several slice-coding types in H.264/AVC. In I slices all macroblocks are coded without referring to any other pictures of the video sequence. Prior coded images can be used to form a prediction signal for macroblocks of the predictive-coded P and B slice types, where

P stands for predictive and B stands for bi-predictive. Another two slice types are SP

and SI slices, which are specified for efficient switching between bitstreams coded at various rates.

(30)

4 4

1.2 Main Innovative Features in H.264/AVC

Comparing to the previous video coding standards, H.264/AVC achieves significant improvement in coding efficiency. This is due to the fact that a number of new techniques are adopted in this standard, such as:

Variable Block Size (VBS) Motion Estimation,

Multiple Reference Frames,

Quarter-pixel Motion Estimation,

Directional Prediction of Intra Coded Blocks,

In-loop Deblocking Filter,

Integer DCT Transform,

Context-based Adaptive Binary Arithmetic Coding (CABAC).

As a result, H.264 can save about 64% bit-rate of MPEG-2 and about 38% bit-rate of MPEG-4. And some of them will be introduced below.

1.2.1

Variable Blocks Size Motion Estimation (VBSME)

Motion estimation (ME) is used as a main method for removing the redundantly information between frames in many video standards. H.264, like other video encoders, adopts block-based motion estimation to find a best block matching from a pre-defined search area, and performs variable block size motion estimation to

(31)

5 5

indicate individual motion object in a macroblock. These seven block sizes, which are 16x16 down to 4x4, partition a macroblock into several parts. They describe that different objects may move in different ways. Figure 1.1 shows the seven block sizes and corresponding mode number/symbol. As it shown in this figure, there is only one block in mode 1, and two blocks in mode 2 or mode 3. When mode P8x8, a macroblock is partitioned into four 8x8 blocks, and the conditions of them are considered individually. Furthermore, we can divide these seven block sizes into two levels, which are macroblock level and sub-macroblock level. In macroblock level, there are four inter modes and an additional skip mode (mode 0) which uses the same size with mode 1. If the macroblock is processed in sub-macroblock level, it can be further partitioned into 8x8, 8x4, 4x8 and 4x4 block sizes. The same works will be done in the four sub-macroblocks.

Figure 1.1 Variable block sizes and corresponding mode number

(32)

6 6

Figure 1.2 The advantage of multiple frame reference

1.2.2

Multiple Reference Frame

Sometimes, the best prediction may not happen in the previous frame, because of the repetitive motion of the block or camera shaking (Figure 1.2). At this time, better predictions may achieved by proceeding multiple reference frame motion estimation (MRF-ME) on more than one prior coded picture[7] (Figure 1.3). For multi-frame motion estimation, the encoder must store decoded reference pictures in a multi-picture buffer. And combining with variable block sizes, the reference index parameter, which the reference picture is located inside the multi-picture buffer, has to be transmitted for each motion-compensated 16 x 16, 16 x 8, or 8 x 16 macroblock partition or 8 x 8 sub-macroblock.

(33)

7 7

Figure 1.3 Motion estimation with multiple reference frames

1.2.3

Intra Prediction

Intra prediction[3][4] means that the samples of a macroblock are predicted by using only information of already transmitted macroblocks of the same picture. In H.264/AVC, two different types of intra prediction are possible for the prediction of the luminance component. The first type is called INTRA_4x4 (Figure 1.5) and the second one is INTRA_16x16 (Figure 1.4). The macroblock is divided into sixteen 4x4 sub-blocks and a prediction for each 4x4 sub-block of the luminance signal is applied individually when using the INTRA_4x4 type. For the prediction purpose, nine different prediction modes are supported for INTRA_4x4, and four different prediction modes are supported for the type INTRA_16x16. Figure 1.4 shows the four directional predictions of INTRA_16x16, and Figure 1.5 shows the nine directions that corresponding to the nine prediction modes. Otherwise, the intra prediction for

(34)

8 8

the two chrominance components of a macroblock are similar to the INTRA_16x16 type for the luminance signal because the chrominance signals are very smooth in most cases.

Figure 1.4 Intra_16x16 prediction modes

Figure 1.5 Intra_4x4 prediction modes

1.3 Motivation

According to the new techniques described above, H.264/AVC achieves higher coding efficiency than prior video coding standards. However, the large amount of computation makes the encode time extremely increase, thus, it is difficult to be used in practical applications especially in real-time environment. It can be seen that inter

(35)

9 9

modes still take the biggest part of computation. For the reason, we propose a fast inter mode decision algorithm to reduce the encoding time with negligible loss of coding efficiency.

The rest of this paper is organized as follows. Section 2 introduces the implementation of inter mode decision in H.264 reference software and two related works. The proposed fast inter mode decision algorithm is described in Section 3. The experimental results are shown in Section 4. And a conclusion will be given in Section 5.

(36)

10 10

Chapter 2 Mode Decision in H.264 and Related Works

The whole procedure of how to encode a macroblock in inter mode is introduced in this chapter. It can broadly divide into two important part which are variable block size motion estimation (VBSME) and rate-distortion optimization (RDO). While VBSME obtains motion vectors of each block of every mode and RDO technique is the main method to decide a best mode for compression. And some of the difference in motion estimation between H.264 and prior standard will be described first.

2.1 VBSME

Motion estimation is an important technology for removing the redundancy between frames, especially processes in variable block size in H.264. The Full Search (FS) [6] is a common motion estimation method which adopted in many video encoders. It checks all the candidate positions in the search window and finds an absolute best position as motion vector. FS is one of the motion estimation method that adopted by H.264 and we select it for our experiment. Some techniques about FS are used in H.264, such as Sum of Absolute Difference (SAD), Partial Distortion Elimination (PDE), Spiral Search Order and Motion Vector Predictor (MVP).

(37)

11 11

2.1.1

SAD and PDE

The SAD is computed by

∑

=

−

=

^X^Y

y x

y

x

y mv

mv x r y x s r

s SAD

,

1 , 1

) ,

( ) , ( )

,

(

_,

where

s

is the original video signal and

r

is the referenced signal. Each candidate motion vector being considered in the search window is represented by and

, and

mv

x

mv

y

X

and

Y

indicate the width and the height of the block partition, respectively. In the origin, cumulative SAD for the entire block is computed by adding up the SADs of the individual partitions indicated by the chosen mode for the macroblock. Now that it is based on rejecting impossible candidates through inspection of the accumulated partial distortion. It means that if the accumulated SAD for one position under inspection is greater than the minimum stored SAD found so far, this candidate position is rejected and the rest of the computation is saved. This technique is called Partial Distortion Elimination (PDE).

Pixel 1 Line 1

16

8

8 x 16

Figure 2.1 Partial Distortion Elimination

(38)

12 12

For example as shown in Figure 2.1, we select one of the two blocks in mode 3, which the block size is 8 pels in width and 16 pels in height, that is, there exists eight pixels in each line and sixteen lines in this block. At the first of search strategy, a minimum SAD variable used for comparison is declared and set it to a maximum value for initialization. Then at each position, accumulated SAD is computed by adding up the SAD and compares to the minimum SAD line by line. If the accumulated SAD is greater than minimum SAD, the rest of this block can be skipped because the distortion is already exceeded and this candidate position is impossible to be the best. Otherwise, if the computation of accumulated SAD continues to the last line and the distortion is still smaller than minimum SAD, this position is set to the best candidate just from the start point of search strategy and updates the minimum SAD to this accumulated SAD. After that, the search strategy goes to the next candidate position and computes the SAD. Finally, the best candidate position in the search window with minimum distortion is found, and the vector indicates the best matching of this block.

2.1.2

Spiral Search Order and MVP

Generally, search range is

± W

pels (typically W = 16 or 32) in x and y direction extended from the center, and the number of candidate position is

(39)

13 13

)

2

1 2

( W +

. In order to have a higher efficiency of finding motion vectors, search strategy with spiral order is adopted and combined with PDE rather than raster order.

In one word, the search starts from the center to the outer part of window in spiral order. It is because the motion between current and previous frame may not have large change and the motion vector should close to the center. According to the example described above, if search strategy finds the minimum SAD quickly after starting from the center, one position which has exceeded SAD compared to the minimum one should be skipped earlier with less inspection of lines. That is, the earlier that minimum SAD is found, the more computation of SAD is saved.

Figure 2.2 Spiral Search Order [7]

(A) (B)

Figure 2.3 The distributions of the MV displacement for the video sequences [7],

“Container” (A), “Mobile” (B).

(40)

14 14

The search center is determined by Motion Vector Predictor (MVP) as in MPEG-4 or other prior standards. MVP is a vector obtained from three neighboring macroblock and is calculated by the median value of them. The three macroblock are located and denoted as shown in Figure 2.4 where A is on the left, B is on the top and C is on the top and right of the current macroblock. The reason for moving the center to a new position instead of using the location of current macroblock is that the motion of current macroblock may be similar to that of neighboring blocks. So if we move the center based on neighboring motion vectors, the best position may be found quickly through Spiral Search Order.

p: predicted motion vector m: candidate motion vector

Figure 2.4 Motion Vector Predictor

In the prior standards with fixed block size, it is quite simple to obtain MVP from the three candidates because the three ones are certain neighboring macroblocks.

But in H.264, it becomes complicated because of the variable block size.

(41)

15 15

shift

vector motion s

ME for position y

and x predicted py

px

ME for position y

and x candidate cy

cx

factor lambda f

where

py]) s) mvbits[(cy

px]

s) s[(cx st(f,mvbit WeightedCo

MV_COST

: : ,

: , :

,

−

<<

+

2.1.3

Cost Calculation in H.264

H.264 supports three cost calculation criteria in reference software, they are motion vector (MV) cost, reference frame (REF) cost and rate distortion (RD) cost.

The MV cost is calculated by using a lambda factor defined as:

<<

= −

...(A)

and the REF cost is also calculated by using a lambda factor defined as:

factor lambda f

where

ref refbits f

st WeightedCo COST

REF

: ,

)) ( ,

(

_ =

...(B)

In (A),

px

and

py

indicates the search center which is determined by MVP. Note that, the motion vector processed in this reference software is always formed of quarter pel, so each candidate position that in integer pel must shift twice first for calculation using the variable

s

. And both in (A) and (B), returns the cost for the bits of motion vector and reference frame, respectively. At last the RD

()

st

WeightedCo

(42)

16 16

cost is used to decide a best coding mode at the last stage of encoding a macroblock, and is described in Section 2.2.

In H.264, motion estimation uses a cost function to determine the best candidate motion vector by adding up the SAD value and MV_COST. This cost function J for motion estimation is defined as:

COST MV

SAD

J

_motion = + _ ,

and we use J to represent the cost value calculated by this formula below. Search strategy calculates the MV_COST first at one candidate position and checks whether it is greater than the minimum stored J value found so far or not. If so, this candidate position is directly skipped, otherwise it continues proceeding the SAD and PDE to the rest of computation. Similarly to the example in Section 2.1.1, a minimum J variable is declared and set to a maximum value for initialization. If full block is completely computed and the J cost value is still smaller than minimum J, the minimum J will be updated instead of only accumulated SAD.

Finally, the best motion vectors for every reference frame will be stored. Also, the motion costs which belong to each best motion vector of every reference frame are stored because JM [5] decides the best reference frame by adding up motion cost and REF cost. For example of one block, if the option of multiple reference frame is enabled and the number of frame is set to five, there are five motion vectors and five

(43)

17 17

motion costs obtained by motion estimation. And then, each cost adds up corresponding REF cost and select the minimum one for the best reference frame.

Note that in our experiment, the number of reference frame is set to 1, and the REF cost is 0. It is because the reference frame is exactly the previous one, and there is no need to record the parameter of best reference frame of every block.

2.2 RDO

In H.264, mode decision process is performed by minimizing the RD cost function J

Rate SSD

J

_mod_e = +

λ

⋅

where the SSD is sum of square difference between the source video signal and the reconstructed video signal, respectively. SSD is given as

∑

=

− +

−

=

8 , 8

1 , 1

2 8

, 8

1 , 1

2 16

, 16

1 , 1

2

]) ,

, [ ] , [ (

]) ,

, [ ] , [ (

]) ,

, [ ] , [ (

) ,

, (

y x

V V

y x

U U

y x

Y Y

QP MODE y

x c y x s

QP MODE y

x c y x s

QP MODE y

x c y x s

QP MODE c

s SSD

where

s

_Y[

x

,

y

] and

c

_Y[

x

,

y

,

MODE QP

] represent the original and reconstructed luminance component; and are the corresponding chrominance components. R denotes the bit cost for encoding the motion vectors, the macroblock

V

U

s

s , c ,

_U

c

_V

(44)

18 18

header, and all of the residual information. That is, the rate is calculated by taking into consideration the length of the stream after the last stage of encoding. Finally,

λ

is a Lagrangian multiplier, and it can be considered as a trade-off parameter between the rate and distortion. The optimal coding efficiency can be achieved by checking all available modes and selecting the one which has minimum cost.

2.3 Related Works

There are a number of fast mode decision algorithms proposed for H.264/AVC video coding [8]-[18] recently. In [10], they select the five macroblocks as candidates by using the concept of 3DRS motion estimation algorithm which is proposed in prior standard. As shown in Figure 2.5, macroblock A and B are located in the same frame with current macroblock. Macroblock C, D and E are located in the previous frame, and C is co-located by current macroblock. The modes and motion vectors belong to these candidates are used for checking the best prediction in current macroblock. By calculating the rdcost which is described in Section 2.2, the candidate with minimum cost will be chosen as the prediction result. This algorithm greatly reduces the computation of motion estimation by checking only a few vectors. But the accuracy of mode and motion vectors is probably not very high.

(45)

19 19

Figure 2.5 The five candidates in [10]

[9] identifies a macroblock into two types. One is called stationary block and the other is homogeneous block. Stationary block refers to the “stillness” between consecutive frames in the temporal dimension. And a region is homogeneous if the textures in the region have similar spatial property. This kind of blocks in the picture would have similar motion and are very seldom split into smaller blocks (shown in Figure 2.6). The way of determining homogeneous block is to use edge information.

An edge map is created for each frame using Sobel operator. Then, they calculate the amplitude of edge by summing up the edges of x and y direction. Finally, they check whether it is a homogeneous block according to the amplitude. Nevertheless, additional computation decreases the time reduction.

(46)

20 20

Figure 2.6 Homogeneous and stationary blocks

(47)

21 21

Chapter 3 Proposed Method

Take the properties of the seven block sizes into consideration, we can observe that small block sizes usually appear at the edge of motion object. This is due to the size of macroblock is fixed to 16x16 and the context may not just belong to one object, for instance, an object and the background are included in one macroblock.

Furthermore, if the motions of objects are more complex and half partitioning (mode 2 or mode 3) a macroblock is not enough, small block sizes will achieve better prediction by partitioning the macroblock into four 8x8 sub-macroblock.

Similar to the idea of the method [9] described in Section 2, a region with complex textures and the motion is not stillness, should be partitioned into small block size. That is, a macroblock, which belongs to the edge of a moving object, is more likely to be partitioned using small block size. Although the mode may be chosen to mode 2 or mode 3 when the edges of object just fall into one side of the macroblock (i.e. one of the two blocks in mode 2 or mode 3), it is not often occurred because the shape of objects belonged to the nature is usually irregular. It means that if a macroblock can be identified that it contains the edge of a moving object, we can determine that this macroblock should be partitioned into small block size (i.e. P8x8 mode).

(48)

22 22

According to the property of a common video sequence, there is less change between two frames. The co-located macroblock in previous frame has the highest correlation, so it can be used to predict the mode for current macroblock. For the enhancement of mode prediction, not only the co-located macroblock but also the neighbors of that are taken as the candidates. Referring to Figure 3.1, the macroblock named ‘C’ is the co-located one in previous frame, and the eight neighbors surrounded C are also chosen as candidates.

Figure 3.1 The co-located macroblock and its neighbors are candidates.

C

(49)

23 23

Figure 3.2 The number of each candidate.

For the convenience, we assign a number to each of the nine candidates. Figure 3.2 shows the result. The ‘5’ is the co-located macroblock in previous frame. The ‘2’,

‘4’, ‘6’ and ‘8’ are top, left, right and bottom candidates related to the ‘5’ respectively.

And the ‘1’, ‘3’, ‘7’, and ‘9’ are top-left, top-right, bottom-left and bottom-right candidates related to the ‘5’ respectively. In order to find the edge, we use the four patterns to check whether the ‘C’ is at the edge of moving object. The four patterns are shown in Figure 3.3. We collect the three macroblocks, which belong to each pattern, as a MB set. And the middle of them definitely are ‘C’. That is, we can obtain four MB sets, and each one indicates its own direction.

1 2 3

4 5 6

7 8 9

(50)

24 24

C C

Figure 3.3 The four directional MB sets

Considering the (A) in Figure 3.3, macroblock ‘4’, ‘5’ and ‘6’ are collected to a MB set in horizontal direction. It means that we can check whether ‘C’ is crossing a horizontal edge of moving object by using this MB set. If so, current macroblock is probably at the edge. And the (B), (C) and (D) have the same meanings in vertical, diagonal from top left to bottom and diagonal from top right to bottom left respectively.

To find out which MB set may be the edge, we assign a score for each mode first. Then, sum up the total value by using the score of modes corresponding to the

C C

(A) (B)

(C) (D)

(51)

25 25

three neighbors in each MB set and put it into a variable called “center_value”. Each MB set has its own center_value. Finally, choose the maximum one from the four center_value and check whether the center_value is greater than a threshold. If so, it can be thought as an edge of object, and smaller block sizes such as P8x8 mode is used for the current macroblock. At this time, large block sizes such as mode 1, 2 and 3 are disabled. Table 3.1 shows the scores corresponding to each mode. And we can observe that more P8x8 macroblock appear in the neighbors, the probability of choosing the current macroblock to P8x8 is higher. Furthermore, if P8x8 neighbors concentrate in one of the four MB sets, there must be one moving edge crossing this direction.

mode score 0 0 1 1 2 2 3 2 P8x8 4

9 (intra) 4

10 (intra) 4

Table 3.1 Modes and corresponding scores

(52)

26 26

On the other hand, if the maximum center_value is small enough, that is, all the neighbors contain large block sizes especially skip mode, it can be thought as the background and more modes can be disabled. Another threshold is also used for this checking.

The two thresholds are called upper_thd and lower_thd. The former is used to divide all the inter mode into large block sizes and small block size, which the small one is P8x8 mode and the large ones are 16x16, 16x8 and 8x16. The latter is used to divide the 16x16 mode from the three ones. The algorithm is shown below:

if center_value > upper_thd disable mode 1, 2 and 3 else if center_value < lower_thd disable mode 2, 3 and P8x8 else

disable mode P8x8

According to the experimental result, the upper value is set to 8 and the lower value is set to 2 will obtain the best performance. It means that there must be more than two P8x8 macroblock in the MB set which has the maximum center_value, and there must be at most one 16x16 macroblock in the MB set respectively.

Here we discuss all the possible combinations of the three neighbors that belong to the same MB set. If the center_value is greater than upper_thd, there are three situations which are 9, 10 and 12. Note that the value of center_value is impossible to

(53)

27 27

be 11 under this score of modes. When 9, there must be two P8x8 (or Intra) macroblocks and one 16x16 macroblock. When 10, there must be two P8x8 (or Intra) and one 16x8 or 8x16, because the scores of mode 2 or mode 3 are the same. When 12, there must be three P8x8 (or Intra). On the other hand, if the center_value is smaller than lower_thd, there are two situations which are 0 and 1. When 0, it means that all the three neighbors must be skip mode. When 1, only one 16x16 macroblock in the MB set is allowed.

Finally, there are some limitations in our algorithm. All macroblocks in the first P-frame followed by I-frame must perform all the modes to decide a best one because the candidates are all in previous frame. Otherwise, the outer macroblocks located at left, right, top and bottom of the frame are not considered in our algorithm, either. Because that there are no enough neighbors for us to determine the modes. As shown in Figure 3.4, the macroblocks in light grey must perform all the modes.

(54)

28 28

Figure 3.4 The limitation of algorithm

(55)

29 29

Chapter 4 Experimental Results

The proposed mode decision method is implemented in H.264 reference software JM10.1 [5]. Our test environment is a Personal Computer with 2.4GHz Intel Pentium-IV CPU and 1GB memory under the Windows XP Professional operating system. We select seven QCIF sequences which are commonly used for video compression testing in our experiments. The form of QCIF is 176 pixels in width and 144 pixels in height, that is, there are eleven macroblocks in width and nine macroblocks in height, and totally ninety nine macroblocks contained in one frame.

The first 100 frames of every sequence are tested by our method. The GOP (Group of Picture) structure is IPPP, i.e. the first picture is coded as I-picture and remaining pictures are coded as P-pictures. The Full Search algorithm is used in all experiments with a ±16 pels search window.

Firstly, we list two total time derived from the JM reference software, which are total encoding time and total motion estimation time of each sequence. The total encoding time is counted all the needed time from the first frame to the last one for encoding, and the total motion estimation time is just only counted the time performing motion search in P-frame. The main difference of them is the time performing the reconstruction of macroblock for mode decision which includes inter

(56)

30 30

modes and intra modes. From the table, we can observe that the time of motion estimation is about 50%-60% of total encoding time and not the 80% as we known in the prior standard. This may be caused by the amount of intra modes in H.264 and the improvement of Full Search.

Total_Encoding_Time (sec)

Total_ME_Time (sec)

akiyo 127.686 52.030 coastguard 207.802 119.152

container 153.245 76.112

foreman 169.844 89.693

mother&daughter 138.190 64.234

silent 153.073 73.527 suzie 156.942 86.407 Table 4.1 The total encoding time and motion estimation time

In the second experiment, we inspect the PSNR change, bit rate change in percentage and time change in percentage. The positive values mean increments whereas negative values mean decrements. In the time change, we list both total

(57)

31 31

encoding time and motion estimation time. The change is calculated by:

ΔPSNR (db) =

PSNR

_proposed −

PSNR

_reference

%

×100

−

reference

reference proposed

Bitrate

Bitrate Bitrate

ΔBitrate (%) =

%

×100

−

reference reference proposed

Time Time

ΔTime (%) =

Time

and as shown in table 4.2, the time change listed at the top is motion estimation time and total encoding time is at the bottom. For the comparison, we also list the result of two other methods [9] and [10].

(58)

32 32

Ahmad’s [10] Wu’s [9] Proposed

sequence ΔPSNR

(db)

ΔBitrate

(%)

ΔTime

(%)

ΔPSNR

(db)

ΔBitrate

(%)

ΔTime

(%)

ΔPSNR

(db)

ΔBitrate

(%)

ΔTime

(%)

akiyo -0.19 11.12 -48.89

-72.42

-0.02 0.71 -24.73

-32.80

-0.07 2.17 -36.22

-49.81

coastguard -0.11 4.94 -57.09

-73.80

-0.02 1.28 -26.82

-36.01

-0.03 1.13 -33.59

-44.57

container -0.11 5.46 -52.84

-71.40

-0.01 1.44 -20.69

-21.09

-0.05 1.46 -41.36

-54.14

foreman -0.20 10.04 -55.40

-73.33

-0.09 2.14 -27.13

-33.31

-0.08 2.84 -31.39

-41.32

M&D -0.16 8.25 -57.00

-77.27

-0.08 0.75 -37.90

-48.35

-0.08 0.46 -36.41

-47.30

silent -0.16 14.03 -54.38

-75.41

-0.06 2.46 -26.36

-37.23

-0.04 4.14 -36.53

-50.82

suzie -0.15 7.78 -55.05

-74.85

-0.04 2.18 -35.89

-47.37

-0.01 2.75 -33.71

-43.87

Table 4.2 Changes of PSNR, Bit rate and time

(59)

33 33

We can observe that the time saving of total encoding time in our proposed method exceeds 31% in all sequences with negligible quality loss and bit rate increase.

The time saving method is not much in Wu’s method because it checks more modes for the mode decision process. Moreover, the computation of the edge map could not be ignored.

The increase of bit rate in akiyo is a little high because the motion of this sequence is too small. In fact, at the beginning of this sequence, akiyo did not move last a while time. As a result, the motion is difficult to detect.

The third experiment is that we calculate the average processed mode per macroblock and show it in table 4.3. Finally, RD curves of all sequences are shown in Figure 4.1.

sequence Wu’s Proposed

akiyo 2.97 2.01

coastguard 3.17 3.07

container 3.09 1.89

foreman 3.02 3.13

mother&daughter 2.88 2.38

silent 3.08 2.54

suzie 2.98 3.03

Table 4.3 Average processed mode per macroblock

(60)

34 34

Figure 4.1 ( a ) RD curve of Akiyo

Figure 4.1 ( b ) RD curve of Coastguard

(61)

35 35

Figure 4.1 ( c ) RD curve of Container

Figure 4.1 ( d ) RD curve of Foreman

(62)

36 36

Figure 4.1 ( e ) RD curve of Mother & Daughter

Figure 4.1 ( f ) RD curve of Silent

(63)

37 37

Figure 4.1 ( g ) RD curve of Suzie

(64)

38 38

Chapter 5 Future Work and Conclusion

5.1 Conclusion

This paper presented a fast inter mode decision algorithm by using the MB temporal correlation in video sequence. Because the less change between frames, the co-located macroblock in previous frame and even the neighbors of it are used for prediction of the current macroblock. This fast inter mode decision algorithm uses at most 3 modes as the candidates for checking the best one, and reduces 31%-41% total encoding time, with negligible quality loss about -0.05 dB and bit-rate increase about 2% on average.

5.2 Future Work

The two thresholds used in our algorithm are fixed. This may not always be suitable for any sequence because the motion of each one is quit difference. In addition, we use only the mode of neighbors as the candidate in our algorithm, if more features can be used for mode decision, the accuracy of predicting the mode of current macroblock may possible improve further.

(65)

39 39

REFERENCES

[1] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T.

Stockhammer, T. Wedi, “Video coding with H.264/AVC: Tools, Performance, and Complexity,” IEEE Circuits and Systems Magazine, Vol. 4, No. 1, pp. 7-28, First Quarter 2004.

[2] Gary J. Sullivan, and Thomas Wiegand, “Video Compression – From Concepts to the H.264/AVC Standard,” Proceedings of the IEEE, Vol. 93, No. 1, pp. 18-31, Jan. 2005.

[3] D. Marpe, T. Wiegand, Gary J. Sullivan, “The H.264/MPEG-4 Advanced Video Coding Standard and its Applications,” IEEE Communications Magazine, Vol.

44, No. 8, pp. 134-143, Aug. 2006.

[4] Iain Richardson, “H.264/MPEG-4 Part 10,” in H.264 and MPEG-4 Video

Compression, John Wiley & Sons, 2003, ch. 6, pp.159-223.

[5] H.264/AVC Reference Software JM version 10.1,

http://iphome.hhi.de/suehring/tml/download/old_jm/

[6] Chen-Fu Lin, Jin-Jang Leou, “An Adaptive Fast Full Search Motion Estimation Algorithm for H.264,” IEEE International Symposium on Circuits and Systems 2005, Vol. 2, pp. 1493-1496, May 2005.

[7] Yeping Su, and Ming-Ting Sun, “Fast Multiple Reference Frame Motion

(66)

40 40

Estimation for H.264,” 2004 IEEE International Conference on Multimedia and Expo (ICME), Vol. 1, pp. 695-698, June 2004.

[8] Libo Yang, Keman Yu, Jiang Li, and Shipeng Li, “An Effective Variable Block-Size Early Termination Algorithm for H.264 Video Coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 6, pp.

784-788, June 2005.

[9] D. Wu, F. Pan, K. P. Lim, S. Wu, Z. G. Li, X. Lin, S. Rahardja, and C. C. Ko,

“Fast Intermode Decision in H.264/AVC Video Coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 6, pp. 953-958, July 2005.

[10] N.A. Khan, S. Masud, A. Ahmad, “A Variable Block Size Motion Estimation Algorithm for Real-time H.264 Video Encoding,” Signal Processing: Image Communication, Vol. 21, No. 4, pp. 306-315, April 2006.

[11] Christos Grecos, Ming Yuan Yang, “Fast Inter Mode Prediction for P Slices in the H.264 Video Coding Standard,” IEEE Transactions on Broadcasting, Vol. 51, No. 2, pp. 256-263, June 2005.

[12] Jongmin You, Wonkyun Kim, and Jechang Jeong, “16x16 Macroblock Partition Size Prediction for H.264 P Slices,”, IEEE Transactions on Consumer Electronics, Vol. 52, No. 4, pp. 1377-1383, November 2006.

中 華 大 學

中 華 大 學 碩 士 論 文

題目：以MB在時間上的關係為基礎用於H.264 的快速 Inter模式決策演算法

Fast Inter Mode Decision Algorithm Based on MB Temporal Correlation in H.264

系 所 別： 資訊工程學系碩士班 學號姓名： M09302041 楊士賢 指導教授： 鄭 芳 炫 教授

中華民國 九十六 年 七 月

摘要

致謝

目錄

附圖目錄

第 1 章 序論

1.2.1

1.2.2

1.2.3

第 2 章 在 H.264 中的模式決策流程與相關研究

2.1.1

∑

−

−

−

=

y mv

mv x r y x s r

s SAD

) ,

( ) , ( )

,

(

2.1.2

shift vector motion s

ME for position y

and x predicted py

px

ME for position y

and x candidate cy

cx

factor lambda

f where

py]) s) mvbits[(cy

px]

s) s[(cx st(f,mvbit WeightedCo

MV_COST

: : ,

: , :

,

−

<<

+

2.1.3

<<

= −

COST MV

SAD

J

Rate SSD

J

λ

∑

∑

∑

− +

− +

−

=

]) ,

, [ ] , [ (

]) ,

, [ ] , [ (

]) ,

, [ ] , [ (

) ,

, (

QP MODE y

x c y x s

QP MODE y

x c y x s

QP MODE y

x c y x s

QP MODE c

s SSD

中華大學

中華大學碩士論文

系所別：資訊工程學系碩士班學號姓名： M09302041 楊士賢指導教授：鄭芳炫教授

中華民國九十六年七月

第 1 章序論

第 2 章在 H.264 中的模式決策流程與相關研究

第 3 章提出的演算法

第 4 章實驗結果

第 5 章結論與未來展望