結合樣板及區塊動作補償之雙動作向量預測方法

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

結合樣板及區塊動作補償之雙動作向量預

測方法

Bi-prediction Combining Template and Block Motion

Compensations

研究生：李宗霖

指導教授：彭文孝教授

譚建民教授

(2)

結合樣板及區塊動作補償之雙動作向量預測方法

Bi-prediction Combining Template and Block Motion Compensations

研究生：李宗霖 Student：Chung-Lin Lee

指導教授：彭文孝 Advisor：Wen-Hsiao Peng

譚建民 Advisor：Jimmy J.M. Tan

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

October 2011

Hsinchu, Taiwan, Republic of China

(3)

結合樣板及區塊動作補償之雙動作向量預測方法

研究生：李宗霖指導教授：彭文孝

國立交通大學資訊科學與工程研究所碩士班

摘

要

摘

要

摘

要

本篇論文介紹了一種基於結合樣板比對預測(TMP)和區塊動作補償(BMC)的雙動作向量預測方法。由於本技術採用交疊動作區塊動作補償(OBMC)方法所產生的像素可適性權重係數表來結合兩個分別來自樣板比預測(TMP)和區塊動作補償 (BMC)之預測區塊中的每個像素，因此如何設計這些權重係數值以達到最佳的預測效果，便成了本技術的關鍵部份。為了決定這些權重係數，本論文採用了一種參數化的理論模型來推導產生最佳的權重係數，同時也據此修正了原本傳統動作向量的預測方式，在動作向量預測階段即導入這些權重係數以達到更精確的預測效果。為了要驗證本論文所提出來的雙動作向量預測方法的可靠性及壓縮效能，本論文設計了一系列的實驗，企圖在壓縮效能及運算複雜度上面取得平衡。實驗結果顯示在最佳權衡之下，壓縮效能可以達到 1.1%到 4.1%的範圍內平均 2.2％的 BD-Rate 節省，複雜度方面在合理程度上編碼端上升了 46％的壓縮時間，解碼端則是上升了 33%.

(4)

Bi-prediction Combining Template and Block Motion

Compensations

Student : Chung-Lin Lee Advisor : Wen-Hsiao Peng

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

This thesis introduces a bi-prediction scheme based on a joint application of

template and block motion compensations. Since the template motion is decode-side

inferable, this scheme needs only a motion overhead as that of uni-directional

prediction. Two predictors derived from the template and block matchings are

weighted in a pixel-adaptive manner using OBMC. From an analytical aspect, we

provide an optimal design of window function in a parametric overlapped block

motion compensation (OBMC) framework to further improve the efficiency of

inter-frame prediction. In view of the tradeoff between the performance and the

complexity, we discussed the impacts on the prediction efficiency of the proposed

scheme when the number of template shapes and the number of motion hypotheses

are changed. Also, a fast template search is provided, which greatly reduces the

complexity at the decoder side. As compared with HM3.0, the proposed scheme

achieves an average BD-rate saving of 2.2%, with a minimum of 1.1% and a

maximum of 4.1%. The encoding time increases 46% while the decoding time

(5)

誌

謝

誌

謝

誌

謝

回顧兩年的研究所生涯，首先，我要感謝我的指導教授—彭文孝博士，給予我於學問研究上的指導。彭老師實事求是的精神，與深入剖析問題的態度，其追根究柢與契而不捨的指導方式，已經成為我在學習與研究路上的典範與楷模。其次，我要感謝我的學長—陳渏紋博士與陳俊吉博士，還有我的摯友吳崇豪博士，不辭辛勞的與我討論，給予許多珍貴的意見，並且能適時從旁給予建議修正我不慎偏差的研究方向，使我在這兩年的碩士生涯，不再舉步維艱。謹此對四人致上由衷的謝意。有榮幸進入多媒體架構與處理實驗室，可以在這個優良的環境下不斷學習，又有熱心與親切的實驗室成員們的切磋與討論，是我在學士後時代最充實的時光。感謝我的學長姐們—陳渏紋博士、陳俊吉博士、詹家欣博士、王澤瑋、吳思賢、蔡閏旭、與楊復堯，引領我進入研究生的階段；感謝我的好同學們吳崇豪博士、曾于真、黃嘉彥與陳孟傑，不論是課業上或研究上，他們總是可以一針見血地提出問題的核心要點，給予最直接協助；感謝我的學弟陳彥宇、吳牧軒、吳昱興、朱弘正、與王信硯，在最後這一年內，給予許多無私的協助。最後，我要感謝我的父母—李廷欽先生與蔡素珠女士的栽培，在爭取碩士學位的路上，給予百分之百的支持，讓我免去許多後顧之憂與煩擾。感謝我的胞弟—李文郁，給予我滿滿手足的關懷。感謝我的老師、家人、與朋友們，是你們的支持，使我有信心取得這個學位，謝謝你們。

(6)

List of Tables

2.1 Comparisons of tool features between H.264/AVC and HEVC. . . 7

4.1 Common test conditions. . . 21

4.2 Experimental settings of TB-mode. . . 22

4.3 The start position a and b for various 2Nx2N PU sizes. . . 22

4.4 BD-rate savings and processing time ratios of TB-mode with 3- and 5-shape-adaptive configurations. . . 24

4.5 BD-rate savings and processing time ratios of TB-mode with theoretical and heuristic window functions. . . 25

4.6 BD-rate savings and processing time ratios of enabling multiple-hypotheses. 26 4.7 BD-rate savings and processing time ratios of TB-mode after applying fast algorithms. . . 28

(9)

List of Figures

1.1 Joint application of TMP and BMC. . . 3

2.1 Basic concept of MRG. . . 8

2.2 Basic concept of TMP. . . 9

2.3 Motion sampling and prediction error of BMC. . . 10

2.4 Motion sampling and prediction error of TMP. . . 11

3.1 Joint application of TMP and BMC. . . 16

3.2 (a) geometry relationship between TMP centroid, and sb. (b) SMSE surface as a function of the location of pixel b. . . 18

3.3 Window functions for typical template designs. . . 19

(10)

CHAPTER 1 Introduction

1.1 Research Overview

With the growth of high definition (HD) (or even Ultra HD) acquisition and display technologies, anticipation of a need for higher coding efficiency has led to the develop-ment of a new video coding standard, named high efficiency video coding (HEVC). This standard, which is currently under the joint development of the Moving Pictures Expert Group of the International Organization for Standardization and International Elec-trotechnical Commission (ISO/IEC MPEG) and the Video Coding Experts Group of the International Telecommunication Union Telecommunication Standardization Sec-tor ( ITU-T VCEG), aims to provide higher compression efficiency (∼ 50% bit-rate reduction) than the state-of-the-art H.264/AVC standard while being capable of oper-ating in low-complexity or high-efficiency modes. Its applications range from mobile HD video to Ultra HD video, covering a frame resolution from WVGA (800 × 480) to 4K × 2K and beyond, such as 8K × 4K .

Using the data accessible to the decoder for motion inference has recently emerged as a promising technique for the next generation video coding standard. Template

(11)

Chapter 1. Introduction

Figure 1.1: Joint application of TMP and BMC.

matching prediction (TMP) [1][2][3] is a typical example that estimates the motion vector (MV) for a target block on the decoder side by minimizing the matching error over the reconstructed pixels in its immediate inverse-L-shaped neighborhood (usually termed the template). Overlapped block motion compensation (OBMC) [4][5], first proposed more than one and a half decades ago, also follows the same rationale. It utilizes the received motion data as a source of information about the motion field and forms a better prediction of a pixel’s intensity based on its own and nearby block MVs. Motivated by the preceding investigations, we were led to develop a bi-prediction scheme, which requires the same motion cost as that of uni-directional prediction. The idea is to combine predictors resulting from the template and block matchings using OBMC. Of particular interest in this combination is that the MV derived from TMP is inferred at the decoder side. Conceptually, the bi-prediction scheme signals only one block motion while retaining almost the same performance as that of the conventional bi-prediction. Moreover, a modified block matching criterion is also proposed to opti-mize the motion parameters to be signaled based on the contribution from the MVs inferred by TMP.

(12)

1.2 Problem Statement

Fig. 1.1 depicts the basic concept of our bi-prediction scheme. Based on the concept of the conventional bi-prediction, our scheme also predicts a target prediction unit (PU1_{) by the two MVs, v}

b and vt, where vb is explicitly signaled and vt is derived by

TMP. Since vtcannot be specified discretionarily, it is important to find vb that, when

applied jointly with vt, minimizes the mean square prediction error:

v∗ b = arg min {vb},{wb(s)} N i=1 s_∈Bi (Ik(s) − wt(s) Ik−1(s + vt,i) − wb(s) Ik−1(s + vb,i))2, (1.1) where vb,i and vt,i is the template and block motions of a specific target block Bi; Ik

and Ik−1 are the current frame and a previously coded reference frame, respectively.

The symbol s is the pixel index relative to the absolute position in a target PU. wb(s)

and wt(s) denotes the window functions associated with every vb,i and vt,i, where

wb(s) = 1− wt(s). Since wb(s) has a decisive effect on the prediction performance,

this problem is thus turned into jointly optimizing wb(s) and vb.

1.3 Contribution

Specifically, our main contributions are included as follows:

• A bi-prediction scheme that requires a motion cost as for uni-directional predic-tion.

• An analytical interpretation of block-based motion-compensated prediction. • A model-based approach that determines an optimal OBMC window function. • A modified motion search criterion that makes the best of both MV’s abilities in

reducing prediction errors.

Experimental results indicate that our best scheme makes an average BD-rate saving [6] of 2.2%, with a minimum of 1.1% and a maximum of 4.1%. Its encoding time increases by 46% while the decoding time increases by 33%. Regardless of complexity, the highest coding gain is achieved with an average of 2.9%, while the minimum of

1_{The coding unit (CU), the basic compression unit as the macroblock (MB) in AVC, has various}

sizes but is restricted to a square shape. The PUs are the various partitions having a square or rectangular shape with several sizes.

(13)

1.3% and the maximum is 5.2%.

1.4 Organization

The rest of this thesis is organized as follows: Chapter 2 briefly introduces the current status of HEVC, then reviews the motion sampling and reconstruction issues, and parametric OBMC framework. Chapter 3 details the proposed technique. Chapter 4 provides experimental results. Finally, this thesis is concluded with a summary of our work.

(14)

CHAPTER 2 Background

2.1 Overview of High Efficiency Video Coding

High Efficiency Video Coding (HEVC) is a draft video compression standard, a succes-sor to H.264/MPEG-4 AVC (Advanced Video Coding), currently under joint develop-ment by the ISO/IEC MPEG and ITU-T VCEG. MPEG and VCEG have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop the HEVC stan-dard. In February 2010, the JCT-VC issued a call-for proposal (CfP). 27 proposals submitted to the ITU/ISO joint committee competing for the next generation video standard. The proposal evaluation results in the April 2010 JCT-VC meeting indicated that a better coding scheme is possible and thus the HEVC work item was launched. So far, the development of the HEVC is still in progress, and the new-generation video standard is expected to be defined in 2012.

After the CfP competition in the 1st _{JCT-VC meeting, Test Model Under}

Consid-eration (TMuC) is constructed mainly from the best performer’s code-base and the other top-performing HEVC proposals. It serves as a good starting point at the very beginning of the collaborative phase, and aims at creating a minimum set of well-tested

(15)

Chapter 2. Background

Table 2.1: Comparisons of tool features between H.264/AVC and HEVC.

Feature H.264/AVC HEVC

Coding, Prediction and Transform Unit

Coding Unit 16 × 16 Macroblock Variable; Large size (8 × 8 to 64 × 64) Prediction Unit Quadtree-based structure Irregular partitioning; Large size Transform Unit 4 × 4 and 8 × 8 Rectangular; Large size

Inter Prediction

MVp Derivation Median Advanced motion vector prediction (AMVP) Motion Inference DIRECT, SKIP Motion merging mode (MRG)

Interpolation Filter 6-tap FIR; Bilinear filter 8-tap DCT-based interpolation filter (DCT-IF) Intra Prediction

Directional Prediction At most 8 directions Angular intra prediction (34 directions) Chroma Prediction Independent prediction Refer to reconstructed luma samples

Transform, Quantization, In-loop Filter, Entropy Coding Transfrom Integer DCT Residual quad-tree transform (RQT) Quantization Matrix Fixed Context-adaptive selection

Adaptive Loop Filter No Yes

Entropy Coding VLC; CAVLC; CABAC Modified CAVLC; CABAC Internal Bit Depth Increase

Bit Depth 8 bits 8/10 bits

tools to establish the HEVC Test Model (HM)1_{. In this thesis, our proposed scheme is}

implemented based on the HM version 3.0 (HM-3.0) software.

To see the current HEVC base structure and coding tools in comparison with H.264/AVC, we summarize the tool features in Tab. 2.1. Important features that are relative to our research in this thesis are further described in the followings sub-sections:

2.1.1 Coding, Prediction and Transform Units

The basic unit for HEVC compression, referred as coding units (CU), is usually a N ×N square region of a frame from 8×8 to 64×64 luma and 4×4 to 32×32 chroma samples. It may contain several prediction units (PU) and transform units (TU) for inter/intra prediction and transform/residual coding. In comparison with H.264/AVC, a CU is a Macroblock (MB), which covers 16 × 16 luma and 8 × 8 chroma samples. Its PUs are the variable MB partitions having a square or rectangular shape with several sizes. In HEVC, the size and shape of CU, PU and TU become more flexible.

(16)

Figure 2.1: Basic concept of MRG.

2.1.2 Inter Prediction

Inter prediction is crucial to the performance of video compression. Variants of inter prediction concept associated with MV could potentially contribute to many new de-signs in HEVC. In H.264/AVC, MV is predictive-coded by a motion vector predictor (MVp). The method of forming MVp depends on the median of the available nearby MVs. In HEVC, advanced motion vector prediction (AMVP) [7] introduces an adaptive motion vector prediction techniques, which sufficiently exploits spatial and temporal correlation of motion vector with neighboring PUs. It constructs MVp candidate set by inferring the available MV from left, top, and co-located PUs with the same reference list and reference frame. The encoder can select the best MVp from the candidate set and explicitly transmits the corresponding index indicating the selected MVp.

The SKIP and Direct modes in H.264/AVC are extended to the motion merging mode (MRG) [8] in HEVC. MRG derives its motion information from neighboring PUs. According to current design, the inferred motion parameters come from left, above, co-located, above-right, and bottom-left PUs as depicted in Fig. 2.1. Those neighboring PUs, which have available motion parameters, termed merge candidates, are grouped into merge candidate set. The index of the chosen candidate will be transmitted to the decoder as an indicator for decoding motion inference. Furthermore, MRG can be coded either w/ or w/o residual.

The existing sub-pel interpolation method in H.264/AVC has been improved by redesigning the filter coefficients. In HEVC, a 8-tap DCT-based interpolation filter (DCT-IF) provides fractional pel accuracy interpolation by replacing the combination

(17)

Figure 2.2: Basic concept of TMP.

of Wiener and bilinear filters with a set of interpolation filters at the desired fractional accuracy. More specifically, instead of a combination of 6-tap and bilinear filtering procedures in H.264/AVC, only one filtering procedure is needed to provide the inter-polation pixel to any pixel accuracy. Thus, the motion compensation process can be simplified in the implementation point of view and the complexity can also be reduced for quarter-pel accuracy.

2.2 Template Matching Prediction

Template Matching Prediction (TMP) [1][2][3] is one of the realization methods of decoder-side motion vector derivation. As can be seen in Fig. 2.2, TMP exploits the correlation between reconstructed pixels of the causal neighborhood (usually an inverse-L-shaped region termed template) and those in the reconstructed frames. The motion search process of TMP is quite similar to that of inter prediction except the inputs to the process. For TMP, since the original pixels of the target PU are not available at the decoder side, the template is used as it were the original pixels for motion search. The MV derived from TMP is thus determined by minimizing the prediction error of pixel intensity between the current template and each template predictor lying in the search range. Then, the MV is directly used for predicting the intensities of the target

(18)

(a) (b)

Figure 2.3: Motion sampling and prediction error of BMC.

2.3 Motion Sampling and Reconstruction

Motion-compensated prediction (MCP) has been a crucial technique in the state-of-the-art video compression design such as HEVC for reducing temporal redundancy. An insightful perspective in MCP is to formalize its notion into a two stage process comprising of sparse motion sampling followed by the reconstruction of temporal pre-dictors. In motion sampling phase, motion estimation acts as a motion sampler taking samples at a true motion field, whereas in reconstruction phase, the prediction pixel values are then acquired by the corresponding area in previous coded frames based on the sampled motions.

Block motion compensation (BMC) is an essential component of MCP. In respect of motion sampling, BMC uses one single motion vector as a linear estimator of the true motion field for a block of pixels. As depicted in Fig. 2.3 (a), the video models of motion field and the field of block motion estimation proposed in [9] have shown that the best MV estimation of BMC is assumed to be the true motion of the block center sc.

In Fig. 2.3 (b), empirical result also indicates that the best motion sampling position with minimum mean square error (MSE) is located on the block center. Furthermore, it has been observed that the MSE at block boundaries tend to be larger than those at block centers. This phenomenon is called motion uncertainty, which produces the blocking artifacts.

Likewise, the motion sampling of TMP also follows the same rationale. As depicted in Fig. 2.4 (a) and (b), statistically, the best MV estimation of TMP is obtained by

(19)

(a) (b)

Figure 2.4: Motion sampling and prediction error of TMP.

approximating a MV to the true motion of the left-top template centroid st. It is said

that the behavior of MV derived from TMP can also be interpreted by the concept of motion sampling.

On the other hand, overlapped block motion compensation (OBMC) uses more so-phisticated algorithms to reconstruct the motion field without additional samples. It directly gives a LMMSE estimate for every pixel’s intensity based on motion compen-sated signals derived from motion vectors sampled at nearby block centers. The design of OBMC window functions are optimized for recognizing and exploiting the non-stationary structure of motion uncertainty. It is noted that OBMC produces smoother motion fields while giving similar or lower mean square prediction error.

2.4 Overlapped Block Motion Compensation

This section briefly reviews the basics of OBMC, to aid the understanding of POBMC. In words, OBMC is to find a LMMSE estimate of a pixel’s intensity value Ik(s) based

on motion-compensated signals {Ik−1(s + v(si))}Li=1 derived from its nearby block

MVs {v(s_i)}L

i=1. From an estimation-theoretic perspective, these MVs are

plausi-ble hypotheses for its true motion, and to maximize coding efficiency, their weights = [w , w , ..., w ]T _{are chosen to minimize the mean squared prediction error subject}

(20)

to the unit-gain constraint [5]:

w∗ _{= arg min} w ξ(w) s.t. L i=1 wi = 1, (2.1) where ξ(w) = E    Ik(s) − L i=1 wiIk−1(s + v(si)) 2  . (2.2)

Applying the Lagrangian method to (2.1) then gives

w∗ _{= R}−1 P_{− U} _U_T_R₋₁_P − 1 UT_R−1_U , (2.3)

where [R]_ij = E[Ik−1(s + v(si))Ik−1(s + v(sj))] and [P]j = E[Ik(s)Ik−1(s + v(sj))]

respectively stand for auto- and cross-correlation matrices while U is a column vector with all elements equal to one [5]. Given that the underlying intensity and motion fields are stationary and that motion samples are taken on a square lattice (such is the case when an image is divided into a group of square blocks for motion search), the optimal weights w∗ _{for pixel s depend solely on its relative position within a block.}

They are often obtained using the least-squares approach, due to a lack of probabilistic models of real data.

The concept of OBMC can generalize to the case of irregular motion sampling structure. Since that both auto- and cross-correlation functions are spatially varying, the challenge lies within the weighting coefficient optimization of each pixel asscociated with nearby MVs. The least-squares solution, although feasible in theory, is impractical because the storage of weighting coefficients optimized for different contexts is spatially-demanding. A parametric solution which have proposed in [10] is an alternative to tackle this problem.

2.5 Parametric OBMC

POBMC, which is proposed in [10], gives a closed-form formula for the optimal weights. To do so, they need to assume signal models for the intensity and motion fields, which

(21)

gives a direct estimate of the optimal weights w∗_{. This is accomplished by using the}

motion model proposed in [11], which assumes that the difference between the true motion of any two pixels, e.g., s1 and s2, has a normal distribution of the form

vx(s1) − vx(s2) or vy(s1) − vy(s2) ∼ N (0, αr2(s1, s2)), (2.4)

where α is a positive number indicating the degree of motion randomness in horizontal or vertical direction2_{, and r(s}

1, s2) is the 2 distance (measured in the unit of pixel)

between s1 and s2.

With the signal model in (2.4), they next proceed to determine the optimal weights w∗ _{using calculus. To begin with, they rewrite, by noting that} L

i=1wi = 1, the mean

squared prediction error ζ(w) in Eq. (2.1) as

ξ(w) = E    _L i=1 wid(s; v(si)) 2  , (2.5)

where d(s; v(si)) = Ik(s) − Ik−1(s + v(si)) denotes the residual signal when Ik(s) is

predicted from the motion-compensated signal Ik−1(s + v(si)) using the MV for block

i, v(si).

To continue, they borrow a result in [11], which shows that if (2.4) is valid, then E{d2_{(s; v(s}

i)} has a closed-form formula given by

E{d2(s; v(si)} = E{(Ik−1(s + v(s)) − Ik−1(s + v(si)))2} = r2(s, si), (2.6)

where is a constant indicating the joint randomness of the motion and intensity fields; Ik(s) = Ik−1(s + v(s)) with v(s) denoting the true motion of pixel s; and the block

MV v(si) is approximated as the motion associated with the block center si. What

remains to be determined is those non-diagonal terms, i.e., E{d(s; v(si)d(s; v(sj)}, i =

j. Under some mild conditions, they assume that the prediction errors {d(s; v(si)}Li=1

(22)

Upon setting the gradient of ξ(w) with respect to w to 0, the optimal weights w∗

becomes w∗ ₌ _L i=1 1 r2_{(s, s} i) −1 1 r2_{(s, s} 1) , 1 r2_{(s, s} 2) , ..., 1 r2_{(s, s} L) T . (2.8) The significance of this result is that it requires only the geometry relations of pixel s _{and its nearby block centers {s}_i_}L

i=1 to obtain {w∗i}Li=1. This remarkable property

(23)

CHAPTER 3 Combining Template and Block Motion

Compensations

3.1 Concept of Operation

Figure 3.1 depicts the basic concept of the proposed scheme. Like the conventional bi-prediction, it predicts a target PU based on two predictors. These predictors how-ever are weighted in a pixel-adaptive manner using POBMC [10], with one of them derived from a MV vt found by TMP [1][2][3] and the other from the usual motion

compensation. Since vt can be inferred on the decoder side, this scheme has to signal

motion parameters for only one block MV (denoted as vb). Additionally, we restricted

v_b _{to be uni-directional prediction here in order to reduce the motion cost needed for} bi-prediction.

(24)

Chapter 3. Combining Template and Block Motion Compensations

Figure 3.1: Joint application of TMP and BMC.

the target block B by POBMC framework. It should be noted that we restrict different PUs of the same size share the identical window functions. As a result, the definition of vb is decribed as (1.1).

One of the solution of minimizing (1.1) is the least-squares method, which is an under-determined problem since a distinct solution has to be sought for each possible context. Although the least-squares method is feasible and optimal in theory, it’s still impractical since the training process is too much time consuming.

Instead of least-squares method, we resort to the parametric framework in Section 2.5. To proceed, we start with an exploration of its average behavior. According to the motion sampling positions in a target PU B, vt is approximate to the true motion

v_(s

t) of pixel st at the template centroid [12]. However, we avoid making the same

approximation for vb because the search criterion is no longer to minimize the sum

of squared prediction errors1 _{(cf. (1.1)). v}

b is approximatd as the true motion of

some unknown pixel b in B. Now the problem of determining wb(s) as the search

for an optimal sampling position sb, sb ∈ B that minimizes the sum of mean squared

prediction errors (SMSE) over B:

E s_∈B (Ik(s) − wt(s) Ik−1(s + vt) − wb(s) Ik−1(s + v(b)))2 . (3.1)

1_{A block MV approximates the pixel true motion at the block center only if its search criterion is}

(25)

To compute the expectation in (3.1), we replace Ik(s)−Ik−1(s + v(si)) with d(s; v(si),

which denotes the residual signal when Ik(s) is predicted from the motion-compensated

signal Ik−1(s + v(si)), and then rewrite (3.1) as

s_∈B

E(wt(s) d(s; v(st)) + wb(s) d(s; v(b)))2

. (3.2)

We assume the prediction errors in the target PU B are uncorrelated with each other, i.e., E [d(s; v(sb)d(s; v(st)] = 0, then (3.2) is approximate to

s_∈B

wt(s)2E[d2(s; v(st))] + wb(s)2E[d2(s; v(b))]. (3.3)

Here we borrow the result in (2.6), that is, E[d2_{(s; v(s}

b))] and E[d2(s; v(st))] have

closed-form formulas given by r2_{(s; s}

b) and r2(s; sb). Moreover, according to (2.8),

we have wb(s) = wb∗(s) = r2(s; st)/(r2(s; st) + r2(s; b)). Hence, (3.3) becomes

s_∈B

(w∗t(s))2r2(s; st) + (w∗b (s)) 2

r2(s; b)). (3.4)

Due to the non-linear nature of (3.4), sb must be found by numerical method, that

is, to compute SMSE for every admissible location of b. Once it is solved, the w∗ b (s)

and w∗

t (s) are thus obtained immediately by (2.8). Then, (1.1) is reformulated as

v∗ b = arg min_v b s_∈B (Ik(s) − wt∗(s) Ik−1(s + vt) − w∗b(s) Ik−1(s + vb))2. (3.5)

To verify where sb should be located, we take an example as illustrated in Fig. 3.2

(a). In such case, Fig. 3.2 (b) plots the SMSE surface as a function of b according to (3.4). As can be seen, SMSE value decreases when b approaches to the the bottom right quarter. A more precise calculation shows that the optimal location of b (thus

(26)

(a) (b)

Figure 3.2: (a) geometry relationship between TMP centroid, and sb. (b) SMSE

surface as a function of the location of pixel b.

bottom-right quarter to minimize the prediction errors in the remaining part of B.

3.3 Window Functions

In this thesis, five different template designs are evaluated in 2N × 2N PUs as de-scribed in the first column of Fig. 3.3, while the second and third columns plot the corresponding window functions of w∗

t(s) and wb∗(s), for vt and vb. The waveforms of

template shapes, e.g. AL, suggest a special type of geometry motion partitionings [13] with two MVs located on the diagonal running from above-left to bottom-right corners within a PU. Also, AR and BL follow the same rationale. Following the same line of derivation, we can obtain the window functions for those rectangular template designs. In particular, asymmetric-like motion partitionings [13][7] result when the template region locates directly above or to the left of a target PU (cf. Fig. 3.3). Two concep-tual differences however are to be noted. First, unlike explicit geometry or asymmetric partitions, these implicit “soft” partitions incur less motion cost (only one MV is to be signaled). Second, there is a strong interdependency between the transmitted and inferred MVs because of OBMC (cf. (1.1)).

(27)

Chapter 3. Combining Template and Block Motion Compensations Above-Left (AL) w∗ t(s) w∗b(s) Left (L) wt∗(s) w∗b(s) Above (A) wt∗(s) w∗b(s) Above-Right (AR) wt∗(s) w∗b(s)

(28)

CHAPTER 4 Experiments

4.1 Experimental Conditions

4.1.1 Common Test Conditions

In this chapter, the experiments are conducted based on the HEVC reference soft-ware HM-3.0 and the HEVC common test conditions (JCTVC-E700 [14]). The HEVC common test conditions are desirable to configure experiments in a well-defined envi-ronment and ease the comparison of the outcome of experiments. JCTVC-E700 defines eight different test conditions, but only four of them are related to bi-directional inter-frame coding:

• Random access, high efficiency (RAHE). • Random access, low complexity (RALC). • Low delay, high efficiency (LDHE). • Low delay, low complexity (LDLC).

Each test condition has a specific configuration with the ON/OFF of coding tools which are summarized in Tab. 4.1. Our proposed scheme are tested based on those test conditions in order to compare their BD-rate savings [6] with HM-3.0 anchor.

(29)

Chapter 4. Experiments

Encoder Configurations RAHE RALC LDHE LDLC

GOP Size 8 8 1 1

NumOfReference L0:2, L1:2 L0:4 Entropy Coder CABAC CAVLC CABAC CAVLC Adaptive Loop Filter (ALF) Y N Y N Internal Bit Depth (IBDI) 10 8 10 8

QP 22, 27, 32, 37

Sequences 1080p, 832 × 480, 416 × 240, 720p

CU Sizes 8 × 8 ∼ 64 × 64

Search Range ±64

Bi-Prediction Search Range ±4 Interpolation Filter 8-tap DCT-IF

Table 4.1: Common test conditions.

Rough estimations of complexity are performed by showing the encoding time ratio and decoding time ratio relative to HM-3.0 anchor.

4.1.2 TB-mode

For the configuration of our proposed template-based bi-prediction scheme (referred hereafter as TB-mode), we applied it only to 2N × 2N PUs. Three (AL, L, and A) or five (AL, L, A, AR, and BL) template shapes are fetched from the reconstructed frame with template width 4. For each 2N × 2N PU, one flag is set to switch adaptively between TB-mode and the usual inter mode. When the former is chosen, it codes at most two (three templates) or three (five templates) extra bits to specify the template shapes.

Moreover, in this chapter, we have two types of window functions to be evaluated on TB-mode. One is formed by the theoretical window functions that have been mentioned in Section 3.3 and the other is formed by a heuristic design of window functions. For theoretical window functions, the weighting coefficients are rounded offline into 16-bit integers. On the other hand, the weighting coefficients of the heuristic window functions are represented in 3-bit integers. To verify the performance of TB-mode, several experiments featuring different performance and complexity trade-offs are summarized in Tab. 4.2 and will be discussed in the following sections.

(30)

Table 4.2: Experimental settings of TB-mode.

Algo. Template Window Functions

TMP SrchRng

TMP-Bi

SrchRng Hypothesis Fast Algo.

T3-C-UU 3 shapes Theoretical ±4 N/A 2 N

T5-C-UU 5 shapes Theoretical ±4 N/A 2 N

T5-S-UU 5 shapes Heuristic ±4 N/A 2 N

T5-S-UB 5 shapes Heuristic ±4 N/A 3 N

T5-S-BU 5 shapes Heuristic ±4 ±1 3 N

T5-S-BB 5 shapes Heuristic ±4 ±1 4 N

T3-S-F 3 shapes Heuristic ±4 ±1 4 Y

PU Size Pos w1,1 w1,2 w1,3 w1,4 w1,5 PU Size Pos w2,1 w2,2 w2,3 w2,4 w2,5

64x64 a 47 15 15 15 15 64x64 a 62 31 31 31 31 b 2 2 2 2 2 b 2 2 2 2 2 32x32 a 23 7 7 7 7 32x32 a 30 15 15 15 15 b 2 2 2 2 2 b 2 2 2 2 2 16x16 a 11 3 3 3 3 16x16 a 14 7 7 7 7 b 2 2 2 2 2 b 2 2 2 2 2 8x8 a 5 1 1 1 1 8x8 a 6 3 3 3 3 b 2 2 2 2 2 b 2 2 2 2 2

Table 4.3: The start position a and b for various 2Nx2N PU sizes.

4.2 Heuristic Window Functions

Each PU, when coded in the proposed scheme, has multiple window functions as de-noted by wn,m with n = 1, 2 and m = 1, .., 5. The parameter n is explicitly signaled in

one extra flag, and the value of m is inferred according to the choice of the template shapes. The coefficient value of each wn,m takes values from either the set {0, 1, 4, 7}

or the set {0, 1, 4, 6}, and thus the multiplication by a floating-point number can be easily replaced by an integer arithmetic. Their waveforms illustrated in Fig. 4.1 form a partitioning of a PU into four non-overlapping regions, and each region corresponds to a specific coefficient. It should be noted that the zero numbers cover over half or three-forth region of a window function. Pixels in that region are not compensated by OBMC, which can effectively halve the expense of memory bandwidth.

To resize a window function according to the size of the considered PU, the start point a and the width b are recorded. Tab. 4.3 lists the values of a and b for every possible 2N × 2N PU size. In view of this resizing criterion, the storage requirements for weighting coefficients are thus conspicuously reduced.

(31)

wn,1(AL) w1,1 w2,1

wn,2(L) w1,2 w2,2

wn,3(A) w1,3 w2,3

(32)

Table 4.4: BD-rate savings and processing time ratios of TB-mode with 3- and 5-shape-adaptive configurations.

Random Access RAHE RALC

Algo. T3-C-UU T5-C-UU T3-C-UU T5-C-UU S03/S05/S06 −1.2 −1.3 −1.4 −1.5 Class C −2.0 −2.1 −1.7 −1.8 Class D −1.9 −2.0 −1.6 −1.7 All −1.7 −1.9 −1.6 −1.7 Enc. Time [%] 176 184 175 182 Dec. Time [%] 172 175 194 197

Low Delay LDHE LDLC

Algo. T3-C-UU T5-C-UU T3-C-UU T5-C-UU S03/S05/S06 −1.9 −2.0 −2.4 −2.6 Class C −2.3 −2.4 −2.4 −2.5 Class D −2.1 −2.2 −2.2 −2.4 Class E −3.3 −3.6 −3.4 −3.6 All −2.4 −2.5 −2.6 −2.7 Enc. Time [%] 157 166 155 163 Dec. Time [%] 220 225 269 273

4.3 Compression Performance of TB-mode

This section illustrates the compression performance and complexity of TB-mode by restricting the two MVs of TMP and the target PU to uni-directional predictions. Both theoretical and heuristic window functions as well as 3- or 5-shape-adaptive implemen-tations are evaluated.

4.3.1 Coding Efficiency versus Number of Templates

We first focus on the coding efficiency between 3- and 5-shape-adaptive implementa-tions. Tab. 4.4 presents the average BD-rate savings of T3-C-UU and T5-C-UU. The former experiment is the typical TB-mode with 3-shape-adaptive theoretical window functions, while the latter shows the result of 5-shape-adaptive TB-mode with the-oretical window functions. Clearly, T5-C-UU evalulates two additional templates in each 2N × 2N PU at the encoder, these additional RD-comparisons constantly delivers about 0.1% coding gains at the cost of increasing encoding complexity. Since the sizes of the two additional templates are small, the time ratio increment at the encoder side is only 14.3%. Moreover, 4% increase of decoding time is observed due to the two additional template searches performed at the decoder side.

(33)

Table 4.5: BD-rate savings and processing time ratios of TB-mode with theoretical and heuristic window functions.

Algo. T5-C-UU T5-S-UU T5-C-UU T5-S-UU S03/S05/S06 −1.3 −1.3 −1.5 −1.4 Class C −2.1 −2.2 −1.8 −2.0 Class D −2.0 −2.3 −1.7 −2.0 All −1.9 −2.0 −1.7 −1.8 Enc. Time [%] 184 207 182 210 Dec. Time [%] 175 166 197 186

Low Delay LDHE LDLC

Algo. T5-C-UU T5-S-UU T5-C-UU T5-S-UU S03/S05/S06 −2.0 −2.0 −2.6 −2.1 Class C −2.4 −2.6 −2.5 −2.7 Class D −2.2 −2.6 −2.4 −2.7 Class E −3.6 −3.5 −3.6 −3.2 All −2.5 −2.6 −2.7 −2.6 Enc. Time [%] 166 186 163 186 Dec. Time [%] 225 216 273 262

4.3.2 Theoretical versus Heuristic Window Functions

Here we focus on the comparison between the design of theoretical and heuristic window functions, which are denoted by T5-C-UU and T5-S-UU in Tab. 4.5. Experimental results of T5-C-UU and T5-S-UU reveals that the additional set of heuristic window functions not only compensates the coding loss after the simplification of weighting coefficients, but also slightly increases 0.1% coding gains on average. For the encoding time increment, although each set of heuristic window functions reduces computation overhead more than a theoretical one, the extra set of RD-comparisons still brings about 24% increments on encoding time ratio. With regard to decoding complexity, since zero weighting coefficients reduces the computations for performing OBMC, the decoding time drops at about 10%.

4.4 Multiple Hypotheses

In this section, we discuss the effect of coding efficiency when multiple hypotheses are enabled for template and block motions. Experiments of multiple hypotheses are tested

(34)

Table 4.6: BD-rate savings and processing time ratios of enabling multiple-hypotheses.

Algo. T5-S-UU T5-S-UB T5-S-BU T5-S-BB T5-S-UU T5-S-UB T5-S-BU T5-S-BB S03/S05/S06 −1.3 −1.4 −1.4 −1.9 −1.4 −1.7 −1.6 −2.2 Class C −2.2 −2.4 −2.5 −2.8 −2.0 −2.3 −2.2 −2.6 Class D −2.3 −2.4 −2.5 −2.8 −2.0 −2.3 −2.2 −2.6 All −2.0 −2.1 −2.2 −2.5 −1.8 −2.1 −2.0 −2.5 Enc. Time [%] 207 285 250 301 210 297 259 302 Dec. Time [%] 166 169 239 262 186 197 281 327

Low Delay LDHE LDLC

Algo. T5-S-UU T5-S-UB T5-S-BU T5-S-BB T5-S-UU T5-S-UB T5-S-BU T5-S-BB S03/S05/S06 −2.0 −2.0 −2.2 −2.3 −2.1 −2.3 −2.4 −2.6 Class C −2.6 −2.7 −2.8 −3.0 −2.7 −2.8 −2.9 −3.1 Class D −2.6 −2.9 −2.9 −3.2 −2.7 −3.0 −2.9 −3.2 Class E −3.5 −3.4 −3.7 −3.8 −3.2 −3.5 −3.6 −4.0 All −2.6 −2.8 −2.9 −3.1 −2.6 −2.9 −2.9 −3.2 Enc. Time [%] 186 269 250 253 186 280 257 263 Dec. Time [%] 216 216 368 375 262 269 464 485

• T5-S-UB: Experiment of enabling bi-prediction to block motions. • T5-S-BU: Experiment of enabling bi-prediction to template motions.

• T5-S-BB: Experiment of enabling bi-prediction to both template and block mo-tions.

Averagely, T5-S-UB outperforms 0.2% in terms of BD-rate saving with 5% average decoding time increment. Enabling bi-prediction to the target PU for finding block motions almost has no effect on decoding time complexity. T5-S-BU reflects similar benefits to T5-S-BU with 0.3% BD-rate saving. Nevertheless, since the bi-prediction of TMP performs at the decoder side, the decoding time dramatically increases about 136% on average. On the other hand, it is interesting that the encoding time increment of T5-S-UB is 30% higher than T5-S-BU. The reason is that the size of templates and the range of template matching are generally smaller than the size of target PUs and the range of block motion search.

T5-S-BB enables bi-directional to both template and block motion search, which reaches an average BD-rate saving of 2.9% and a maximum BD-rate saving up to 5.2%. Although the performance of T5-S-BB are very impressive between all the configurations for TB-mode, the significantly increased encoding and decoding times make this scheme less practical. As a result, a further reduction in TB-mode complexity is necessary.

(35)

4.5 Fast Algorithm

As concluded in previous section, several enhancements of TB-mode are conducted to achieve the decreased time complexity and the moderate coding gains. To tackle the complexity issue of TB-mode, we start with the modification from T5-S-BB, which has promising coding gains and copious runtimes over all the TB-mode experiments. As summarized below, four major enhancements will be applied for speedup the runtimes of TB-mode:

• Reduce the number of template shapes moderately.

• Fast mode decision by skip TB-mode when SKIP mode has lowest RD cost among all the other modes.

• Limit the number of reference frames to be searched.

• Use bilinear filter for sub-pel interpolation during the TMP process.

4.5.1 Enhancements for Encoder Only

The major contribution of encoding time is the extra mode decision process of TB-mode R-D comparisons. As an additional prediction TB-mode, decreasing the number of TB-mode evaluations is one way to reduce the encoding time complexity. According to our observation, the area size of enabling SKIP mode adjusts slightly before and after TB-mode is applied. This observation implies that the encoder is unlikely to choose TB-mode when SKIP mode is the best candidate. As a result, if the best mode is SKIP, we bypass the TB-mode evaluations. Moreover, since the 3-shape-adaptive TB-mode drops neglectable coding gains than the 5-shape adaptive one, we reduce the number of template shapes to be tested for further reducing encoding time complexity.

4.5.2 Enhancements for Encoder and Decoder

TMP performs its motion search on both encoder and decoder sides, which has a great impact of TB-mode complexity. To diminish the motion cost caused by TMP, two approaches have been taken: The first is to reduce the number of reference frames

(36)

Table 4.7: BD-rate savings and processing time ratios of TB-mode after applying fast algorithms.

Algo. T3-S-F T5-S-BB T3-S-F T5-S-BB S03/S05/S06 −1.6 −1.9 −1.8 −2.2 Class C −2.3 −2.8 −2.1 −2.6 Class D −2.1 −2.8 −2.0 −2.6 All −2.0 −2.5 −2.0 −2.5 Enc. Time [%] 149 301 153 302 Dec. Time [%] 124 262 136 327

Low Delay LDHE LDLC

Algo. T3-S-F T5-S-BB T3-S-F T5-S-BB S03/S05/S06 −1.7 −2.3 −1.9 −2.6 Class C −2.3 −3.0 −2.3 −3.1 Class D −2.5 −3.2 −2.5 −3.2 Class E −2.9 −3.8 −3.1 −4.0 All −2.5 −3.1 −2.5 −3.2 Enc. Time [%] 142 253 144 263 Dec. Time [%] 128 375 144 485

TB-mode. These reference frames are derived by referring to the reference indices used in the 2N × 2N MRG mode during the mode decision process. If there is only one available reference index or there are duplicated reference indices, an additional reference frame with the lowest QP in GOP structure is considered as another candidate to be evaluated in TB-mode.

In the latter issue, we revise the interpolation filter of TMP fractional-pel motion search by interpolating the reference PU with a bilinear filter. The bilinear filter brings a conspicuous complexity reductionl however, it also generates poor template motions resulting in a coding loss. Fortunately, this inefficiency can be partially compensated by other MVs (thus the block motions) in TB-mode.

4.5.3 Summary

Experiment T3-S-F describes the performance of TB-mode after applying those en-hancements introduced in this section. As in Tab. 4.7, T3-S-F has a moderate to significant average BD-rate saving of 2.2%, with a minimum of 1.1% and a maximum of 4.1% over all test cases. Although T3-S-F has an average coding loss of 0.7% com-pared with T5-S-BB, 131% for encoding and 237% for decoding time consumption are still impressive in reducing time complexity.

(37)

CHAPTER 5 Conclusion

In this thesis, we propose a bi-prediction scheme that combines predictors found by template and block motions with parametric OBMC window functions. Since the template motion is inferred on the decoder side, it requires only a motion cost as that of uni-directional prediction. For optimizing the motion parameters to be signaled, the motion search criterion is modified to reflect the interdependency between vb and vt.

The choice of window function is based on the inferred MV constellation, which brings a better adaptation and prediction efficiency. Refer to the experimental results, a promising coding gain (2.9%) brings a cost of significant increase in both the encoding and decoding times. As a result, several modifications are made to strike a better balance between performance and complexity. After applying those modifications, the best scheme shows moderate-to-significant coding gains (2.2%) with reasonable complexity increments (46% and 33%). This result shows that it is possible to keep

(38)

Chapter 5. Conclusion

decoded MVs from neighboring PUs. In this manner, the need to perform TMP is waived at the cost of extra bits. We shall continue these investigations in our future work.

(39)

Bibliography

[1] K. Sugimoto and et al., “Inter Frame Coding with Template Matching Spatio-Temporal Prediction,” Proc. Int. Conf. Image Processing, 2004.

[2] Y. Suzuki and et al., “Inter Frame Coding with Template Matchin Averaging,” Proc. Int. Conf. Image Processing, 2007.

[3] S. Kamp and et al., “Decoder Side Motion Vector Derivation for Inter Frame Video Coding,” Proc. Int. Conf. Image Processing, 2008.

[4] S. Nogaki and M. Ohta, “An overlapped block motion compensation for high quality motion picture coding,” Proc. IEEE Int. Symp. Circuits and Systems, pp. 184—187, May 1992.

[5] M. T. Orchard and G. J. Sullivan, “Overlapped Block Motioin Compensation: An Estimation-Theoretic Approach,” IEEE Trans. on Image Processing, vol. 3, pp. 693—699, May 1994.

(40)

Doc-BIBLIOGRAPHY

[8] M. Winken and et al., “Description of Video Coding Technology Proposal by Fraunhofer HHI,” JCTVC-A116, Apr. 2010.

[9] B. Tao and M. Orchard, “A Parametric Solution for Optimal Overlapped Block Motion Compensation,” IEEE Trans. on Image Processing, vol. 10, pp. 341—350, Mar. 2001.

[10] Y. W. Chen and W. H. Peng, “Parametric OBMC for Pixel-Adaptive Temporal Prediction on Irregular Motion Sampling Grids,” IEEE CSVT, 2011.

[11] W. Zheng and et al., “Analysis of Space-dependent Characteristics of Motion-compensated Frame Differences based on a Statistical Motion Distribution Model,” IEEE Trans. on Image Processing, vol. 11, pp. 377—386, Mar. 2002.

[12] T.-W. Wang and et al., “Analysis of Template Matching Prediction and its Ap-plication to Parametric Overlapped Block Motion Compensation,” IEEE ISCAS, 2010.

[13] M. Karczewicz and et al., “Video Coding Technology Proposal by Qualcomm Inc.,” JCTVC-A121, Apr. 2010.

[14] F. Bossen, “Common Test Conditions and Software Reference Configurations,” JCTVC-E700, Mar. 2011.

結合樣板及區塊動作補償之雙動作向量預測方法

國

立

交

通

大

學

資訊科學與工程研究所

碩

碩

碩

碩

士

士

士

士

論

論

論

論

文

文

文

文

結合樣板及區塊動作補償之雙動作向量預

測方法

Bi-prediction Combining Template and Block Motion

Compensations

研 究 生：李宗霖

指導教授：彭文孝 教授

譚建民 教授

結合樣板及區塊動作補償之雙動作向量預測方法

Bi-prediction Combining Template and Block Motion Compensations

研 究 生：李宗霖 Student：Chung-Lin Lee

指導教授：彭文孝 Advisor：Wen-Hsiao Peng

譚建民 Advisor：Jimmy J.M. Tan

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

結合樣板及區塊動作補償之雙動作向量預測方法

研 究 生：李宗霖 指導教授：彭文孝

國立交通大學資訊科學與工程研究所 碩士班

摘

要

摘

摘

要

要

摘

要

Bi-prediction Combining Template and Block Motion

Compensations

Student : Chung-Lin Lee Advisor : Wen-Hsiao Peng

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

誌

謝

誌

誌

謝

謝

誌

謝

Contents

List of Tables

List of Figures

CHAPTER 1

Introduction

1.1

Research Overview

1.2

Problem Statement

1.3

Contribution

1.4

Organization

CHAPTER 2

Background

2.1

研究生：李宗霖

指導教授：彭文孝教授

譚建民教授

研究生：李宗霖 Student：Chung-Lin Lee

國立交通大學

資訊科學與工程研究所

碩士論文

研究生：李宗霖指導教授：彭文孝

國立交通大學資訊科學與工程研究所碩士班