以內容特徵為基礎之運動向量估測演算法及架構研究

(1)

國

立

交

通

大

學

電機與控制工程學系

博士

論

文

以內容特徵為基礎之

運動向量估測演算法及架構研究

On Study of Content-Based ME

Algorithms and Architectures

研究生：鄭顯文

指導教授：董蘭榮博士

(2)

運動向量估測演算法及架構研究

On Study of Content-Based ME

Algorithms and Architectures

研究生：鄭顯文 Student：Hsien-Wen Cheng

指導教授：董蘭榮 Advisor：Lan-Rong Dung

國立交通大學

電機與控制工程學系

博士論文

A Thesis

Submitted to Department of Electrical and Control Engineering College of Electrical Engineering and Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Electrical and Control Engineering July 2005

Hsinchu, Taiwan, Republic of China

(3)

(4)

(5)

(6)

List of Figures

1-1 Main Processing flow in JPEG, MPEG, H.261, and H.263 encoding. 2

2-1 Block matching motion estimation process. . . 8

2-2 The Three Step Search. . . 11

2-3 The New Three Step Search. . . 12

2-4 The Four Step Search. . . 14

2-5 The Diamond Search. . . 15

3-1 Flow chart of the EFBLA algorithm . . . 23

3-2 The reusability of quantized data in EFBLA . . . 27

3-3 Two scan directions employed in EFBLA . . . 28

3-4 Block diagram of the edge-driven two-phase motion estimation. . 30

3-5 Architecture of Edge Generator Unit. . . 32

3-6 Architecture of Shift Register Array and Low-Resolution Quanti-zation. . . 33

3-7 Architecture of Processing Element Array. . . 34

3-8 Architecture of Processing Element. . . 35

3-9 Architecture of UEPC Adder Tree and SMVs Selector. We as-sume that N is 16. . . 36

(10)

3-10 Architecture of the second phase. . . 38

3-11 Execution of the Accumulator Cells Array in the condition of N = 8 and p = 8. . . 39

3-12 MAD curves of FS, LRQ and EFBLA for four clips. (a) The Akiyo Clip. (b) The Children Clip. (c) The Stefan Clip. (d) The Weather Clip. . . 43

4-1 The system block diagram of a portable, battery-powered multi-media device. . . 47

4-2 This diagram represents the non-linear discharging properties of battery. (a) Rate capacity effect. (b) Recovery effect. . . 50

4-3 The subsample mask of the generic subsample rate 8-to-6. . . 52

4-4 The flow chart of content-based subsample algorithm . . . 54

4-5 The content-based subsample algorithm . . . 55

4-6 The components of a content-based subsample mask (CSM) . . . 58

4-7 The block diagram of the edge-determination unit with adaptive control mechanism. . . 60

4-8 Response time of four clips. (a) The Dancer Clip. (b) The News Clip. (c) The Paris Clip. (d) The Weather Clip. . . 62

4-9 Quality degradation curves of four clips. (a) The Dancer Clip. (b) The Hall Monitor Clip. (c) The News Clip. (d) The Paris Clip. . . 67

4-10 Average quality degradation curve of 21 test clips. . . 68 4-11 The 26th frame of dancer clip. (a) GSA with 8-to-3 subsample

rate (b) Residual of motion compensation by GSA (c) CSA with 8-to-3 subsample rate (d) Residual of motion compensation by CSA 68

(11)

4-12 The 18th frame of table-tennis clip. (a) GSA with 8-to-3 subsam-ple rate (b) Residual of motion compensation by GSA (c) CSA with 8-to-3 subsample rate (d) Residual of motion compensation

by CSA . . . 69

4-13 The block diagram of the power-aware ME architecture driven by the content-based subsample algorithm. . . 70

4-14 The architecture of PEs array and RMB buffer. . . 72

4-15 The structure of a PE. . . 73

4-16 The architecture of high-pass filter. . . 74

4-17 The architecture of Sobel filter. . . 75

4-18 The architecture of Morphological gradient filter. . . 75

4-19 The structure of CSM generator. . . 76

4-20 The execution phases of the power-aware architecture. . . 77

4-21 The power switching curves of four clips. (a) The Dancer Clip. (b) The News Clip. (c) The Paris Clip. (d) The Waterfall Clip. . . 82

(12)

List of Tables

3.I. Quality degradation analysis for different video clips. . . 41 3.II. Computational load analysis for different video clips. . . 42 4.I. Analyze the effect of edge threshold parameter m1 on subsample

pixels. . . 59 4.II. Average stationary error for 21 video clips with Kp = 0.2. . . 61 4.III. Analyze the effect of controlled edge threshold parameter on

sub-sample pixels. . . 63 4.IV. Quality performance of PSNR by GSA for different video clips. . 65 4.V. Quality performance of PSNR by ACSA with high pass filter for

different video clips. . . 66 4.VI. Implementation of the power-aware architecture. . . 78 4.VII.Power analysis of the power-aware architecture . . . 80

(13)

Abstract(Chinese)

º'£?3ÚGD¹_DîBW ×n"PÝ¥* Ítx ÝêÝÎt8Ïi« 3` îÝ¥£G |¾Õ{D¹fÝê Ý9øÝ*BÂ½ÝTà3&ËÚGD¹ýã ¬Î3Ä8 nÝ@~¡Î½¨´Õ°TïÎ">¨´Õ° /ÎÊ ÚG/©ÇÝTà .hÍ¡Z|ÚG/©Ç Ã¼"DÍ3º '£?ÝTà Í¡ZèËËÝÕ°CÍÚx × Ë$ð P">¨´°J/Úx Þ éÙáÆº'£?Õ°/Úx ËïK Î¿àN×Å s/!!§ |¾Õª±ºÕCé ÙáÆÝêý ¡ï?¿à××^×|[etâÝ£¾ Õ?·ÝéÙáÆ[ Ë$ðP">¨´Õ°¿àN× s{I ±; § ¬¿àh×±;¡{£GÝ©PX¨´+à]' | b[Ý¥¿à±;£]Q¡ºà9°±;{£G fEÝýã "> ótbÝÉ'ºÏÞ$ðÞ@fE à ÏÞ$ðJÎ¿à× ½P s¨´Õ°ðàÝ SADsum of

absolute fEýã3Ï×$ðX®ÝÉ' Ot·Ý

ºÉ'9øÝ®°|ª±JÝºÕ ¬v3XèÝ @¨ÚxEy£]ÝDã ù1¹8 Ý¥ºà|C!JD ã©P t¥ÝÎ3º'9`²îG °&Ý´Ë

(14)

éÙáÆæº'£?Õ°/ÚxJÎµA!ÚG/© Ç Þµ s/Ý£] 5 {£GC±£G¬vGE±£G® ãø§ DÄ××^×¾Õ1¹Jãø£y×P ãyãø £3º'£?Úxìfy£ .h3h×Õ°ì@¨ÝÚ x 3|éo xéæ¼ÙÝ÷ñPÚGHî µïéo©PJ ¸Ìn´·éÙéo¸à` , !`î¹´?Ýº'9 `² 3t¡ÝÿaÍ¡ZXèÝ|ÚG/©Ç Ãì ¡ÎË$ðP">¨´Õ°/ÚxTéÙáÆæº'£?Õ °/Úx 3÷P9ªÝTàr½îKb´·ÝéÙC`²[

(15)

Abstract(English)

The major objective of this thesis is to apply content-based approaches for motion estimation algorithms and architectures. Motion Estimation (ME) has been proven to be effective to exploit the temporal redundancy of video sequences and, there-fore, becomes a key component of many multimedia standards, such as MPEG-X and H.26X standards. In such multimedia systems, the motion estimation domi-nates huge computation load and tends to consume much power. This issue has become a significant problem. In order to solve this problem, to develop fast searching algorithms and power-aware architectures becomes a most important issue for such video systems, especially for a portable video device which is pow-ered by battery. Although a great deal of effort has been made on this field, consid-ering the content property of video source on the motion estimation application is still seems to be lacking. In this thesis, we adopted a content-base methodology to meet the requirement of fast searching ME and power-aware ME for such portable video devices. This thesis proposes a edge-driven two-phase ME algorithm based on the content of video sources to reduce computation load in the matching pro-cedure and a content-based power-aware algorithm which adaptively subsamples the background pixels only to perform grace trade-offs between quality degra-dation and power consumption. By employing the content-based methodology, these proposed algorithms, either for fast searching algorithm or power-aware

(16)

gorithm, can achieve better results than the non-content-based ones.

In the proposed two-phase motion estimation, to match the low resolution quantized edge pixels of a macro-block is used in the first phase. According to the edge pixels span, the algorithm makes decision of suitable search scan-direction to reuse the quantized data more efficiently. Then it generates the survived motion vectors for the second phase which employs the SAD as the error criteria to per-form accurate matching. This content-driven algorithm can reduce the significant computational load comparing with the full-search algorithm and still be more efficient than the existed two-phase algorithm. The content-based power-aware algorithm performs power-aware function by disable/enable processing elements according to the subsample mask based on the content of the video sources. The power-aware approach extracts the edge pixels of a macro-block and subsamples the non-edge pixels only to maintain the quality performance in acceptable level. Since the power consumption is proportional to the subsample rate, this content-based algorithm adopts a close-loop control mechanism to avoid the diverse prob-lem of subsample rate in various video sources and hence keep the subsample rate in stationary state. Founded on the proposed content-based algorithm, the power-aware architecture can dynamically operate at different power consumption modes with little quality degradation only according to the remaining capacity of battery pack to achieve better battery discharging property.

Motivating from the applications of content methodology, this thesis proposes a fast algorithm and a power-aware algorithm to implement the corresponding ar-chitectures for conquering the drawbacks without employing content-based tech-nique for the portable multimedia devices. As the simulation results showed, the proposed content-based ME algorithms and architectures can achieve better power

(17)

and quality performance for the portable multimedia applications than those with-out adopting the content-based methodology.

(18)

Acknowledgement

3.±¼¡òO^Õb9Í^º¥ÁÍ@& ¬v5¿J Õ.E&9ÎîFE&tÝ=± b9øÝ^ºW. ´Ä6 lÒE&ÝY¹ 9ð lÒ¬ÎE&b¢Bzî ÝO ¯&×).¼C13D8 ÝBzH 9øÝ= &ëË+¼¼/TÝ= ©½3 ò×EÉ¡ ? ÑåÕlÒE&Ý= ÍgÄ6 l/9ð ¼Ý¼0C>3 ¯&¼2sì æCxCæ l/À3&$Õ«`&Ê6ÝèøC¼F ¸ &5¿MÓNÍnW9 W ¼ô ý õrO> 0Õ(>0u>0WÃ>0 {8>0»Ûè>0C 1ÁÈ>0 ÆÝ¼0Cò¸ÿ9SÝ¡Z?n Ùþn@Ý.!.|C.Æ 9ð |¼Ä 6 ¯Æ3¼îÝ!8Õ©C6> NNÀ3¯ÆÝD¡ k&@~Ýj À}ÿ°.Fß_t Æÿ°. $6$o&Z »ÿ Æÿ°!. ñ{ |C.Æ+0à.×ruå k ÉÛhËîWÆWý)O.*C &ð ¯Æ ô G2¥iâI*ÝÀB§J¥ßùÕïÀ? xiii

(19)

ÕÀ|CWöß 9° Æ 3&[ ¡¥Á. ¼Ý9ð® Æ&ÝyÇCIY¸&/ÕIxÑ9 ×¼¼sì ^¯ÆÜÃ&ÆìÝÃ&b*^9øÝW µ 5¿ãÿ.ÆÎ^ Æ t¡Ä6 &HHµE&P¨PCÝY¹ 9×¼b ¯WU Î&ßßtÝ5É ¯|h¡Z¤&TÝC&ÆÁ ÝýbCT5 S»b§=ÝTP§@3P°××¼¼/TÝ © »ÍW75 F6ZaÝE. m ÝßH9Ýµ FÏ F Z y¬þp Ú`, ìêêï

(20)

Chapter 1 Overview

1.1 Background

1.1.1 Video Coding System

Video coding system has been developed to reduce the transmission rate or stored bits for over twenty years and proven to achieve the objective. Many standards are defined to implement the video coding system such as ISO/IEC 1, MPEG-2, MPEG-4 and the CCITT H.261 / ITU-T H.263, etc [1–6]. The demand of these video standards is to remove the redundancies of the video sources and compress the video sources to meet the constraints of limited transmission rate and stored bits. In order to achieve this demand, transform coding and predictive coding have become important strategies to identifying the large amount of spatial dependency and temporal redundancy in the video sources.

Figure 1-1 illustrates a typical block diagram of an JPEG, MPEG, H.261, and H.263 video coding system. The video encoder system contains several major components, including discrete cosine transform (DCT), inverse DCT (IDCT),

(21)

Video Input Pre-Processor Pre-Processor Frame Memory Frame Memory DCT DCT Q Q Q-1 Q-1 IDCT IDCT Frame Memory Frame Memory Motion Compensation Motion Compensation ME Processor ME Processor VLC Encoder VLC Encoder Motion Vector Reference MB Current MB Stream Buffer Stream Buffer Stream Out JPEG Coding Functions

Figure 1-1: Main Processing flow in JPEG, MPEG, H.261, and H.263 encoding. motion estimation (ME), motion compensation (MC), quantization (Q), inverse quantization (Q−1_{), and variable-length coding (VLC) encoder.}

In those components of such video systems, motion estimation is a key proces-sor employing predictive coding technique to eliminate the temporal redundancy and it is the most computationally expensive part. An encoder creates a predictive frame of the current frame based on the reference frame (either previous or future frame) and forms a residual between the predictive frame and the reference frame. The bits to encode the residual can be fewer than to encode the original frame if the prediction is successful. Although the motion estimation based on the pre-dictive technique can achieve high compression rate, it consumes much

(22)

computa-tional complexity in the matching procedure if it performs exhaustively searching strategy in the searching area. According to the complexity analysis, the motion estimation part is up to over 50% computational load of a MPEG or a H.261 cod-ing system [4, 7]. Thus, many motion estimation algorithms, to optimize either the full search or fast search block matching, had become an important research field and been developed to meet various requirements for video applications.

1.1.2 Motion Estimation

As mentioned above, Motion Estimation (ME) has been proven to effectively eliminate the temporal redundancy of video sequences and therefore becomes a central part of the ISO/IEC MPEG-1, MPEG-2, MPEG-4 and the CCITT H.261 / ITU-T H.263 video compression standards. The motion estimation achieves very high compression rate by identifying the temporal redundancy and eliminat-ing them since there is large amount of correlation between successive frames in a video sequence. Of various approaches for motion estimation, the block-matching algorithm is widely performed in those video coding systems since its regular property. The block matching approach first divides a frame into non-overlapped blocks regularly of the same size and find the motion vector by finding the most like block in the searching area. According to the motion vectors of the macro-blocks in a frame, the coding system encodes the residual part between the original frame and motion-compensated frame to raise the compression ratio.

In those block-matching algorithms, the full-search block-matching (FSBM) algorithm is the most popular approach because of its considerably good quality and regular data path. Many works addressed on implementing architectures of the full-search algorithms. Yang et al. presented 1D-array architectures [8] and

(23)

many researchers addressed on 2D-array architectures [9–11]. Lai et al. proposed architecture with data reuse scheme which accesses the reference pixels more ef-ficiently but has restriction in the searching area [12]. Some works focused on the discussion of low-power design of FSBM architectures [13, 14]. Paper of Tuan et al. provides a data reuse analysis of FSBM architectures and proposed one ac-cess architecture to achieve optimal memory bandwidth [15]. A fast full search algorithm with adaptive scan direction is presented to speed up the conventional full-search algorithm [16]. Those researches performed significant achievements in the implementation of full-search block-matching algorithms.

Although the FSBM algorithm has the benefits of considerably good quality, it dominates huge computation load and tends to consume significant power because of its exhaustive search scheme. In order to solve this problem, to develop fast searching algorithms becomes a most important issue for these video systems, especially for a portable video device which is powered by battery. Many fast search algorithms were proposed for alleviating the heavy computational load of FSBM by reducing the search steps, such as the three-step search (TSS) [17], the new three-step search (NTSS) [18], the one-dimensional full search (1DFS) [19], the four-step search (4SS) [20], and the diamond search (DS) [21–23], etc. Some researchers developed fast algorithms by simplifying the matching criterion [24– 27]. Those fast algorithms conquered the drawbacks of full-search algorithm and accomplished great achievements in the application of video coding system.

This thesis will illustrate parts of those motion estimation algorithms in Chap-ter 2 in detail.

(24)

1.2 Objectives

Although lots of effort has been stressed on the research field of motion estima-tion, employing the content methodology on the motion estimation application is still seemed to be lacking. The major objective of this thesis is to focus on the employment of content property upon motion estimation. Upon this argument, we concentrate on two parts, one is for the fast searching algorithm and another is for the power-aware application. As mention above, the fast algorithm can reduce the computational complexity of the video system so this topic is still worth to develop and stress, especially employing the content methodology. Another is-sue for power-aware application has become very important since the demand of portable video devices powered by battery raises these recent years. The power-aware mechanism performs switching the power consumption modes with grace quality degradation according to the non-idea battery properties to extend the bat-tery life in such portable devices. This thesis proposed a content-based algorithm to implement power-architecture to meet this requirement for portable applica-tions.

In the proposed content-based fast algorithm, to match the low resolution quantized edge pixels of a macro-block is used in the first phase. According to the edge pixels span, the algorithm makes decision of suitable search scan-direction to reuse the quantized data more efficiently. Different video content causes dif-ferent scan-direction in which the reusability of quantized data is better. Then the first phase removes the most impossible candidates and generates the survived motion vectors for performing accurate match which employs the SAD as the er-ror criteria in the second phase. This content-driven fast algorithm can reduce the significant computational load comparing with the full-search algorithm and be

(25)

more efficient than the existed two-phase algorithm.

In the second part, a content-based algorithm performing power-aware func-tion by disable/enable processing elements according to the content-based sub-sample mask is presented. In order to avoid the aliasing drawbacks of general subsample technique, the proposed content-based approach extracts the edge pix-els of a macro-block and subsamples the non-edge pixpix-els only to maintain the quality performance in acceptable level. Since the power consumption is propor-tional to the subsample rate, the algorithm adopts a close-loop control mechanism to keep the subsample rate in stationary state. Founded on the content-based al-gorithm, the power-aware architecture can dynamically operate at different power consumption modes with little quality degradation according to the remaining ca-pacity of battery pack to achieve better battery discharging property.

1.3 Organization of this Dissertation

The rest of this dissertation is organized as follows. In Chapter 2, we will present the related works of block-matching motion estimation. Chapter 3 illustrates the algorithm and architecture of edge-driven two-phase motion estimation. Then in Chapter 4, a content-based power-aware algorithm and architecture are addressed. Finally, conclusions and future works are shown in Chapter 5.

(26)

Chapter 2 Related Works

In this chapter, the related works of motion estimation are introduced. Section 2.1 illustrates the popular algorithm of full searching block matching (FSBM) and fast search algorithms by various methods to overcome the drawbacks of FSBM algorithm are presented in section 2.2.

2.1 Full Search Block Matching Algorithm

The FSBM algorithm with SAD criterion is the most popular approach for motion estimation because of its considerably good quality and regular data path. Figure 2-1 illustrates the representation of block matching motion estimation. In this fig-ure, the block matching process first divides a current frame into non-overlapped blocks of the same size N-by-N called current macro-block (CMB). Then a cur-rent macro-block is exhaustively matched with all the candidate macro-blocks, called reference macro-blocks (RMBs), in the searching area of the reference frame which is either previous frame or next frame. Finally, the block match-ing algorithm identifies the macro-block which has the minimum distortion to the

(27)

N N Frame Width Frame Height Search Window Reference Macro-Block

Motion Vector (u,v)

-p -p p p Previous Frame Current Frame T Current Macro-Block

Figure 2-1: Block matching motion estimation process.

current macro-block from all the reference macro-blocks in the searching area. The desired motion vector is the offset from the reference macro-block with the minimum distortion to the current macro-block.

The full search block matching algorithm uses (2-1) and (4-14) to compare each current macro-block with all the reference macro-blocks in searching area to

(28)

determine the best match. SAD(u, v) =N −1P i=0 N −1_P j=0 |R(i + u, j + v) − S(i, j)|, (2-1)

for −p ≤ u, v < p and the motion vector is figured out by (4-14)

−−→

MV = (u, v)¯¯min−p≤u,v≤p−1SAD(u,v) (2-2)

where the macro-block size is N-by-N, S(i, j) is the luminance value at (i, j) of the current macro-block. The R(i + u, j + v) is the luminance value at (i, j) of the reference macro-block which offsets (u, v) from the current macro-block in the searching range 2p-by-2p.

2.2 Fast Search Algorithm

Although the FSBM algorithm has the benefits of considerably good quality and regular data path, its huge number of comparison/difference operations results in high computational complexity and power consumption. In order to meet the real-time applications, the fast search algorithms have been widely developed and studied by many researchers. These fast algorithm either reduce search steps or simplify calculations of error criterion. These fast algorithms can be divided into three main categories but not limited to them.

1. By reducing the search steps.

2. By simplifying the matching criteria. 3. Two-phase algorithm.

(29)

In the following subsections, we will present these prior works.

2.2.1 Reduce the Searching Steps

Three Step Search (TSS)

The Three Step Search (TSS) [17] uses the rectangular search patterns with log-arithmically decreasing search size to test the checking points. Figure 2-2 illus-trates the search patterns of the TSS with the search area from −7 to 7. Each check point with black color means the local minimum distortion in each search step. In this illustration, the motion vector is (−4, 3). The total checking points of the TSS is 25(= 9 + 8 + 8). With comparing to 255 search steps of the FSBM, the TSS performs considerably less computational complexity with little motion compensated quality loss.

New Three Step Search (NTSS)

The New Three Step Search (NTSS) algorithm based on the center-biased distri-bution of motion vector was proposed for improving the performance of the TSS since the TSS used a uniformly check points in its first step[18]. Figure 2-3(a) presents the procedure and (b) shows the check points in the first search step of NTSS. The NTSS checks the extra eight points of the search window center and uses a halfway-stop technique to speed up the matching process if the motion vec-tor is stationary or quasi-stationary. The total number of check points of NTSS is from 17 in best case to 33 in worst case respectively.

(30)

-7 0 7 7 0 -7 7 0 -7 -7 0 7

The checking points in the first step The checking points in the second step The checking points in the third step

Motion Vector

(31)

First Step of NTSS 17 check points First Step of NTSS 17 check points Decision #1 Decision #1 Decision #2 Decision #2 Second Step of NTSS 3 or 5 check points Second Step of NTSS

3 or 5 check points Second/Third Step of NTSS (same as in TSS) Second/Third Step of NTSS (same as in TSS) MV=(0,0) MV=(0,0) true _false true false MV MV MV MV

17 check points in the first step of NTSS 3 or 5 check points in the second step of NTSS

(a)

(b)

Decision #1: minimum at the search window center?

Decision #2: minimum at one neighbor of the center?

0 4

-4 4 0 -4

(32)

Four Search Step (4SS)

Similar to the NTSS, the Four Search Step (4SS) uses the attribute of center-biased distribution of motion vector and the approach of halfway-stop to save the check points [20]. Figure 2-4(a) illustrates the procedure of the 4SS and (b) shows two different search paths as examples. A black mark in each step is the point which has the minimum distortion error and as the check window center of the next step. From the first step to third step, the size of search window is 5-by-5 and the final step uses 3-by-3. The check point of the 4SS is varied from 17 to 27. It reduces the worse case check points from 33 to 27 and remains similar motion-compensated error as compared to NTSS.

Diamond Search

Diamond Search (DS) employs a diamond-shaped search pattern which is rotated from the square-shaped search pattern in 4SS by 45°[21–23]. It results in fewer check points with similar motion-compensated distortion as compared to NTSS and 4SS. The DS uses two search patterns shown in Fig. 2-5(a), one is large dia-mond search pattern (LDSP) and another is small diadia-mond search pattern (SDSP). Figure 2-5(b) illustrates an example which leads to the motion vector (4, −2) in five search steps, which are four times of LDSP and one time of SDSP. As the experimental results, the DS significantly improves the performance in terms of the required number of check points.

2.2.2 Simplifying the Matching Criterion

The matching criterion is employed to identify the error distortion between the current macro-block and reference macro-block. Equation (2-3) shows the

(33)

crite-First Step of 4SS 9 check points First Step of 4SS 9 check points Center? Center? Second Step of 4SS 3 or 5 check points Second Step of 4SS 3 or 5 check points Fourth Step of 4SS 3x3 window Fourth Step of 4SS 3x3 window true false MV MV

(a)

(b)

Center? Center? Third Step of 4SS 3 or 5 check points Third Step of 4SS 3 or 5 check points false true

The first Step of 4SS The second Step of 4SS The third Step of 4SS The fourth Step of 4SS

0 6

-4 0

4 -7

(34)

(a)

(b)

The 1st Step with LDSP The 2nd Step with LDSP The 3rd Step with LDSP The 4th Step with LDSP

Large Diamond

Search Pattern

Small Diamond

Search Pattern

The final Step with SDSP

0 5 -2 0 2 -5 MV=(4,-2)

(35)

rion of mean square error (MSE) which can achieve significant motion-compensated quality. MSE(u, v) = 1 N · N N −1_X i=0 N −1_X j=0 (R (i + u, j + v) − S (i, j))2 (2-3)

where all the variables are defined the same as (2-1) and (4-14). However, the square operation consumes a lot of computational load by this error criterion. In order to reduce the computational complexity, the mean absolute difference (MAD) or mean absolute error (MAE) is presented which is defined as

MAD(u, v) = 1 N · N N −1_X i=0 N −1 X j=0 |R (i + u, j + v) − S (i, j)| (2-4)

In practical applications, the sum of absolute difference (SAD), defined in (2-1), is usually employed instead of MAD to ignore the mean operation.

In this subsection, some techniques to conquer the drawbacks of the MSE and MAD by simplifying the error criterion in matching process are presented. The Pixel Difference Criterion (PDC)

In this technique, the matching criterion employs the Pixel Difference Counts [28]. Each pixel in a macro-block is clarified into either a matching or a mismatching pixel by Tu,v(i, j) =    1, if |R (i + u, j + v) − S (i, j)| ≤ t 0, otherwise (2-5)

(36)

for 0 ≤ i, j < N and t is the predefined threshold. Then the PDC is defined as P DC (u, v) = N −1_X i=0 N −1_X j=0 Tu,v(i, j) (2-6)

Since PDC counts the number of matching pixels between current macro-block and reference macro-block, the motion vector is defined as the maximum PDC shown in following equation.

−−→

MV = (u, v)¯¯max−p≤u,v≤p−1P DC(u,v) (2-7)

Integral Projection-Matching(IPM)

Integral Projection Matching (IPM) was employed to extract the features of a macro-block as the matching criterion instead the matching criterion mentioned above [26, 29, 30]. The major principle of projection matching is to create a cost function by summing up the luminance value of the row and the column. Equation (2-8) and (2-9) show the integral projection of the current macro-block.

Hk(m) = N −1_X i=0 S (i, m) (2-8) Vk(n) = N −1_X i=0 S (n, j) (2-9)

(37)

for 0 ≤ m, n < N . As the same manner, equation (2-10) and (2-11) illustrate the feature projection of the reference macro-block with searching area parameter p.

Hk−1(m, u, v) = N −1 X i=0 R (i + u, m + v) (2-10) Vk−1(n, u, v) = N −1 X j=0 R (n + u, j + v) (2-11)

for 0 ≤ m, n < N and −p ≤ u, v < p. The R (·) and S (·) are the luminance value, which have been defined above, of reference and current macro-block. Af-ter the cost functions of integral projection are calculated, IPM perform the match-ing step by (2-12) to (2-15). DH(u, v) = N −1_X m=0 |Hk(m) − Hk−1(m, u, v)| (2-12) DV (u, v) = N −1_X n=0 |Vk(n) − Vk−1(n, u, v)| (2-13)

MVy = v|_min_−p≤u,v<p_D_H_(u,v) (2-14)

MVx = u|_min_−p≤u,v<p_D_V_(u,v) (2-15)

2.2.3 Two-Phase Algorithm

Low Resolution Quantization Method

A two-phase fast algorithm by low-resolution quantized scheme was presented by Lee et al [31]. In the first phase, each pixel value of the current macro-block and the reference macro-blocks was quantized as two-bit low resolution by

ˆ

(38)

where Avgkis the total pixel average of the current macro-block, which is defined in (3-6). Avgk = N −1_P i=0 N −1_P j=0 fk(i, j) N2 (2-17)

Then the first phase matched the low-resolution quantized value by

DP C(u, v) = N −1_X i=0 N −1_X j=0 δ[ ˆfk(i, j), ˆfk−1(u + i, v + j)] (2-18) where ˆ fk−1(u + i, v + j) = Q2(fk−1(u + i, v + j) − Avgk), (2-19) and δhfˆk, ˆfk−1 i =    0 , f or ˆfk= ˆfk−1 1 , otherwise (2-20) After the low resolution matching scheme, the first phase generates the pre-defined number of survived motion vectors with minimum DP C in each row for the further accurate matching of the second phase. In the second phase, the algo-rithm figures out the motion vector from the survived motion vectors by matching with SAD criterion.

(39)

Chapter 3 Edge-driven Two-Phase Motion

Estimation

3.1 Introduction

This chapter presents an edge-driven two-phase algorithm and architecture, called Edge-matching First Block-matching Last Algorithm (EFBLA), for fast motion estimation[32, 33]. In the proposed two-phase motion estimation, the major match-ing criterion in the first phase is low resolution quantized edge pixels of a macro-block. According to the edge pixels span, the algorithm makes decision of suitable search scan-direction to reuse the quantized data more efficiently. Then it gener-ates the survived motion vectors for the second phase which employs the SAD as the error criteria to perform accurate matching. This content-driven algorithm can reduce significant computational load comparing with the full-search algorithm and still be more efficient than the existing two-phase algorithm.

Many papers have proposed different ways to reduce the computation

(40)

plexity of the full search algorithm. Most of them target on the elimination of impossible motion vectors, such as SEA[34] and LRQ[31, 35, 36], and only per-form complete matching for the possible candidates. They have done great jobs on the reduction of block-matching evaluations and further save the computation power and cost. Applying this philosophy, the thesis proposes a two-phase algo-rithm employing content methodology to remove the impossible candidates. The edge-driven two-phase algorithm contains two major procedures, one is the edge matching and the other is the block matching. Our goal is to decrease the number of block-matching evaluations without degrading the video quality much such that the computation load can be significantly reduced. Hence, how to effectively re-move the impossible motion vectors becomes the key to solve the cost-consuming problem of the full search algorithm.

The edge-matching procedure does not require complex computation; it only needs shift, quantization, comparison and threshold operations. The edge-matching procedure first performs high-pass filter on a macro-block of the current frame, called a current macro-block, and then determines edge-pixels that have larger value than threshold. According the distribution of edge-pixels, the procedure de-termines the scan direction for high degree of data reusability. Then, the EFBLA starts matching the current macro-block with those reference macro-blocks in the searching area of the reference frame. The matching order is based on the scan direction. The matching criterion is unmatched edge-pixel count (UEPC). An un-matched edge-pixel is the pixel of the current macro-block whose low-resolution quantized value is different from that of the corresponding edge-pixel of the ref-erence macro-block. Obviously, the smaller the UEPC value the more similar the target block to the reference block. Thus, the EFBLA only picks the

(41)

mo-tion vectors with lower UEPC as the survived momo-tion vectors (SMVs). Following the edge-matching phase, the proposed algorithm then performs accurately block matching with the SAD criteria on those SMVs. As results of simulating MPEG video clips, the EFBLA requires fewer addition operations than the full search algorithm.

This chapter is organized as follows. In this Section, we introduced the back-ground and motivation of the two-phase algorithm. Section 3.2 presents the BLA in details and Section 3.3 proposes hardware architecture based on the EF-BLA. Section 3.4 shows the experimental results and a brief summary of this work is given in Section 3.5 finally.

3.2 Algorithm

Figure 3-1 illustrates the flow chart of the Edge-matching First Block-matching Last Algorithm (EFBLA). Assume that the macro-block size is N-by-N and the searching window is 2p-by-2p. The orientation of the current macro-block is (x, y).

3.2.1 Edge-matching phase

The edge-matching phase of EFBLA contains five steps which will be described as below:

Step 1. Perform high-pass filter on the current macro-block.

In the first phase, the proposed algorithm first performs the edge extraction using the general high-pass spatial filter mask, as shown in (3-1)[37]. In (3-1), the

(42)

(Step 1) High-Pass Filter on CMB (Step 1) High-Pass Filter on CMB (Step 2) Edge Determination (Step 2) Edge Determination (Step 3) Determine Scan Direction (Step 3) Determine Scan Direction (Step 4) Quantize Edge-Pixels of CMB (Step 4) Quantize Edge-Pixels of CMB Scan Direction Scan Direction loop u=-p~p-1 loop u=-p~p-1 Quantize RMB Quantize RMB loop v=-p~p-1 loop v=-p~p-1 UEPC(u,v) UEPC(u,v) Quantize v++ Edge-Pixels Quantize v++ Edge-Pixels Select Two SMVs in this Column Select Two SMVs in this Column Column-by-Column loop v=-p~p-1 loop v=-p~p-1 Quantize RMB Quantize RMB loop u=-p~p-1 loop u=-p~p-1 UEPC(u,v) UEPC(u,v) Quantize u++ Edge-Pixels Quantize u++ Edge-Pixels Select Two SMVs in this Row Select Two SMVs in this Row Searching in SMVs with SAD Criteria

Searching in SMVs with SAD Criteria

MV

Row-by-Row

Phase 1 Phase 2

(Step 5)

(43)

that G(i, j) expresses the gradient of the pixel at (i, j) and the larger the value of

G(i, j) the more possible is the pixel on the edge.

G (i, j) = ¯ ¯ ¯ ¯ ¯ 1 P ∆i=−1 1 P ∆j=−1 c · S (i + ∆i, j + ∆j) ¯ ¯ ¯ ¯ ¯, where    c = 8 , when (∆i, ∆j) = (0, 0) c = −1 , otherwise (3-1) for 0 ≤ i, j < N .

Step 2. Edge Determination

In the EFBLA, we use the local edge-determination method in current macro-block. It calculates edge threshold defined as (3-2) to determine the edge pixels. Basically, the algorithm considers those pixels with G(i, j) greater or equal than

Eth are the edge pixels, as shown in (3-3). If the pixel at (i, j) is the edge pixel, α(i, j) is set to 1; otherwise, α(i, j) is set to 0.

Eth = max(G(i, j)) + min(G(i, j)) 2 (3-2) α(i, j) =    1 , if G(i, j) ≥ Eth 0 , otherwise (3-3) In order to increase the accuracy of the edge-matching, the EFBLA also regards the pixels around pixels with G(i, j) greater than Eth as the edge pixels as well. Thus, the EFBLA employs the edge extension as shown in (3-4) to mark the edge pixels. α(i, j) =    1 , if G(i ± 1, j ± 1) ≥ Eth 0 , otherwise (3-4) Step 3. Determine the scan direction.

(44)

The data reusability is highly dependent on the scan direction because it em-ploys the criteria of unmatched edge-pixel count (UEPC), which will be illus-trated in step 5. Before UEPC, EFBLA has to quantize the edge pixels in the macro-block first. There are highly duplicated data in the successive searching steps. Fig.3-2 shows the impact of the scan direction to the data reusability as an example. If the edge pixels are widely distributed along with the y-coordinate, searching along with x-coordinate can reuse the quantized data efficiently. In Fig.3-2(a), a macro-block which size is 8-by-8 and black circles means an edge pixels, that is, the α(i, j) is equal to 1 in the position marked as black circle. In Fig.3-2(b) and (c), we assumed that the searching position shifts from A to B. The gray and black marks represent the edge pixels when the referent macro-block is at the position A. The black and white marks represent the edge pixels when the target block is at the position B. Therefore, the quantized data at the black marks can be reused in matching step that uses the criteria of UEPC. Obviously, it just needs to calculate the quantized edge pixels in the white marks only, then removes the unmatched edge pixel count in gray marks and plus these in white marks. So the scan direction in Fig.3-2(b) has higher degree of data reusability than that in Fig.3-2(c) .

The EFBLA has two scan directions: column-by-column and row-by-row, as illustrated in Fig.3-3. To decide the scan direction, this step first determine the span width of edge pixels with x-coordinate, named the x-span, and the span width of edge pixels with y-coordinate, named the y-span. If the x-span is smaller than y-span, the step selects the column-by-column scan as the direction; otherwise, the scan direction will be row-by-row. As the example shown in Fig.3-2, the value of x-span is equal to four and the y-span is eight, and therefore the efficient

(45)

scan direction is column-by-column.

Step 4. Quantize the edge pixels of the macro-block.

This step quantizes the pixel values at the edge pixels for the low-resolution computation. The philosophy of two-phase motion estimation is to eliminate im-possible motion vectors at the lowest computation cost. Hence, the EFBLA uti-lizes low-resolution computation to perform the edge matching.

Equation (3-5) represents the quantization of the reference blocks where ˆS (i, j)

is the value of two most significant bits (MSBs) of (S(i, j) − Avgk). The reason that the step quantizes (S(i, j)−Avgk) instead of S(i, j) is because the former has higher variance than later. The higher variance leads to higher degree of accuracy for edge matching.

ˆ

S(i, j) = Q2(S(i, j) − Avgk), ∀α(i, j) = 1, (3-5)

where Avgkis the total pixel average of the current macro-block, which is defined as Avgk = N −1_P i=0 N −1_P j=0 S(i, j) N2 (3-6)

Step 5. Perform edge matching and generate SMVs.

Upon the completion of the step 3 and 4, the first phase starts to perform edge matching. First, the EFBLA matches the motion vectors along with the scan direction obtained by the step 3. The edge matching employs the criteria of unmatched edge-pixel count (UEPC), as shown in (3-7). In (3-7), ˆR(u+i, v +j) is

(46)

y x

x-span

y-span

(a) Scan Direction search position A search position B A B y x y x A B searching position B searching position A Scan Direction (b) (c)

(47)

Scan direction

with

row-by-row

Scan direction

with

column-by-column

Figure 3-3: Two scan directions employed in EFBLA

the quantization result of the reference macro-block with the motion vector (u, v).

UEP C(u, v) =

N −1_X i=0

N −1_X j=0

α(i, j) · δ[ ˆS(i, j), ˆR(u + i, v + j)], (3-7)

where ˆ

R(u + i, v + j) = Q2(R(u + i, v + j) − Avgk), ∀α(i, j) = 1, (3-8)

and the delta function is defined as

δ h ˆ S, ˆR i =    0 , f or ˆS = ˆR 1 , otherwise (3-9)

Next, this step generates a pair of SMVs for each scan line, either row or column. The motion vectors with the high UEPCs on a scan line are most likely impossible ones. Thus, the EFBLA only picks the motion vectors with the lowest two UEPC

(48)

reference macro-blocks as the survived motion vectors (SMVs).

3.2.2 Block-matching phase

Following the edge-matching phase, the second phase of EFBLA performs block-matching with SAD criteria on SMVs. Note that the block-block-matching requires much less evaluations than the traditional full search block-matching because the first phase has eliminated a large amount of impossible motion vectors.

3.3 Architecture

According to the Edge-matching First Block-matching Last Algorithm depicted in the previous section, this thesis proposes a two-phase VLSI architecture and its block diagram is showed in Fig. 3-4. In order to achieve the goal of parallel processing and avoid multiple data access from the off-chip frame memory, the proposed architecture is base on a two-dimensional sysolic array in both phases and saves the data of current/reference macro-block in the CMB/RMB buffer. In the following subsections, the architecture and behavior of each block will be illustrated.

3.3.1 The First Phase

The architecture of the first phase contains a Current Macro-Block Buffer, an Edge Generator Unit, a UEPC PEs Array, a Reference Macro-Block Buffer, a Quantiza-tion Unit, an Adder Tree and a Survived MoQuantiza-tion Vectors Selector. After the edge matching processing, the first phase generates two survived motion vectors in each searching row/column for the second phase to perform more accurate matching.

(49)

Edge Generator Unit Edge Generator Unit Quatization Unit Quatization Unit RMB Buffer RMB Buffer

UEPC PEs Array UEPC PEs Array

Accumulator Cells Array Accumulator Cells Array SMVs MV Edge Mask CMB RMB X-Span/Y-Span UEPC Accumulator & SMVs Selector UEPC Accumulator & SMVs Selector

To Address Generator The First Phase

The Second Phase

Motion Vector Selector Motion Vector Selector CMB Buffer CMB Buffer

(50)

Edge Generator Unit

In Fig. 3-5, we presented the architecture of Edge Generator Unit which is used to produce the edge mask and makes decision of search scan-direction described from Step 1 to Step 3 in section 3.2. This unit contains two main blocks, the high-pass filter block and edge determination block. The former block calculates the gradient of each pixel in current macro-block as shown in equation (3-1). The later one is used to determine the edge mask and X-Y span depicted in the Step 3 of EFBLA.

According to (3-1), the high-pass filter calculates the gradient of a target pixel with eight neighbor pixels around it. The data paths, CMB1, CMB2, and CMB3,

are the input interface of previous line, current line and next line from the CMB buffer. The left and right pixels can be reserved by simply delay elements. In order to avoid the boundary error when the target pixel is in the border, the proposed architecture uses multiplexers to switch the null data out of the current macro-block to existent pixels instead. The black-dot in each multiplexer indicates the switching path when the filter unit is processing a border pixel. To calculate the gradient value of a target pixel needs total six equivalent adder operations, which are five adder operations and one absolute operation. We treated the computational load of an absolute operation as an adder operation. The computational load of

×8 is ignored since it can be implemented with simple shift operation.

The edge determination unit, whose structure is illustrated in the right part of Fig. 3-5, implements two main functions. The first one is to figure out the maximum and minimum of the gradient value of the current macro-block and then determines the threshold value according to the equations from (3-2) to (3-4). The second one is to decide the searching scan-direction depicted in Step 3 of EFBLA.

(51)

CMB₁ D D D D abs x8 _ + CMB₂ CMB₃ Gradient D min D /2 /2 cmp N2_{delay line} max right boundary left boundary bottom boundary

top boundary _{High-Pass Gradient Filter} _{Edge-Determination}

Edge Extension Edge Mask XY Span Span Determination

Figure 3-5: Architecture of Edge Generator Unit.

The determination of scan-direction contains simple logic OR gate and look-up table (LUT) to figure out the XY-span. The edge determination unit generates the edge mask and scan direction for the UEPC matching in the first phase.

RMB Buffer and Quantization Unit

Figure 3-6 illustrates the architecture of RMB buffer and Quantization Unit. The reference macro-block (RMB) buffer has two major functions; one is to provide the parallel data for UEPC PEs arrays in the first phase. The second function is to buffer the data of reference macro-block for the second phase since by this way we can ensure that it accesses the data from the reference frame memory only once. In each clock period, the RMB buffer provides N pixels at the same time to the Quantization Unit and the Quantization Unit transfers them to low-resolution data for the matching procedure of UEPC.

In order to save the hardware resource, the quantization procedure for the current block shares the same quantization cell with the reference macro-block. At the initial time, there are (N + 2P − 1) × (N − 1) cycles to store the

(52)

Q N+2p-1 Q N+2p-1 Q N+2p-1 Q UEPC PEs Array RMB Avg_k N-1 N N N+2p-1 To the AC Array

in the second phase m

u x CMB

Figure 3-6: Architecture of Shift Register Array and Low-Resolution Quantiza-tion.

reference macro-block ready to provide the parallel data for the PEs array. In this period, the Quantization Unit is idle and can be switched to quantize the current-macro-block.

Processing Elements Array

The architecture of Processing Elements Array is illustrated in Fig. 3-7. The array is composed of N-by-N processing elements to calculate the criteria of unmatched edge-pixel count shown in (3-7). The data path of CMB in the tail of a row is linked to the head of the next row and thus it needs N2 _{cycles to shift}

all the quantized data of current macro-block into the UEPC PEs array. By this linked data path, to quantize the current macro-block only needs to active one Quantization Unit.

Since the first phase uses the criteria of unmatched edge-pixel count, the PEs array actives the processing element while corresponding pixel is an edge, that

(53)

PE (N-1,N-1) PE (N-1,2) PE (N-1,1) PE (N-1,0) PE (2,N-1) PE (2,2) PE (2,1) PE (2,0) PE (1,N-1) PE (1,2) PE (1,1) PE (1,0) PE (0,N-1) PE (0,2) PE (0,1) PE (0,0) Adder Tree Edge Mask Q u a n t i z e r CMB RMB_N-1 RMB₂ RMB₁ RMB₀

(54)

reg reg CMB Enable RMB cmp To Adder Tree

Figure 3-8: Architecture of Processing Element.

is, the edge mask α(u, v) is equal to 1 shown in (3-4). The turn-on/off signal is from the Edge Mask generated from the Edge Generator Unit. The processing element, which architecture is shown in Fig. 3-8, performs the unmatched edge-pixel comparison and produces a signal 1 if the quantized data of current macro-block is not identical to that of reference macro-macro-block. The architecture of the processing element to calculate the unmatched edge-pixel count is shown in Fig. 3-8. Each processing element contains two two-bit shift register to store the low-resolution information of current macro-block and reference macro-blocks. The compared circuit in a PE can be implemented with two exclusive-OR gates and one OR gate. After the matching process, each processing element generates one bit signal to the adder tree and SMVs selector for further evaluating the correlation between the current and reference macro-block.

UEPC Accumulator and SMVs Selector

The UEPC accumulator is used to accumulate the unmatched edge-pixel signal from each processing element. There is a look-up table (LUT) in each column to transfer the unmatched signals to a binary number which counts that how many

(55)

UEPC PEs Array LUT_N-1 16 LUT₂ 16 LUT₁ 16 LUT₀ 16

Parallel Adder Tree

4 4 4 4 SMVs Selector 8 SMVs UEPC Accumulator

Figure 3-9: Architecture of UEPC Adder Tree and SMVs Selector. We assume that N is 16.

(56)

unmatched pixels in this column. Then the binary number can be summed up by a parallel adder tree to measure the total unmatched edge-pixels in the macro-block. The SMVs selector uses these unmatched edge-pixel counts to pick up two survived motion vectors in each column for further detail matching in the second phase. So the first phase figures out 2-by-2p survived motion vectors which are the most possible motion vectors to the second phase. The architecture of the UEPC accumulator and SMVs selector are shown in Fig. 3-9.

3.3.2 The Second Phase

In the second phase, it consists of an Accumulator Cell Array and a Motion Vector Selector. The former unit is used to accumulate the SAD in position of the sur-vived motion vectors and it is composed of 2-by-N accumulator cells. The later one, Motion Vector Selector, is used to compare the SAD calculated from the Accumulator Cell Array and to pick a best motion vector up with the minimum distortion in the SMVs.

The Accumulator Cells Array

The second phase performs further matching with SAD criteria between those SMVs generated from the first phase. Figure 3-10 shows the block diagram of this phase. It consists of an accumulator cells array with dimension N-by-2 and a controller. Each accumulator cell is used to calculate the SAD value of a row in a macro block. Architecture of the accumulator cell is shown in the dash circle on the right hand. The enable signal from the controller is used to active the accumulator cell when the data in the RMB bus is in the range of the searching position of corresponding survived motion vector generated form the first phase.

(57)

Accumulator Enable AD CMB RMB AC(0,0) AC(0,1) AC(1,0) AC(1,1) AC(2,0) AC(2,1) AC(N-1,0) AC(N-1,1) CMB RMB

Motion Vector Selector

MV

Controller

SAD SMVs

Accumulator Cell (i,j) N ACs Array mux SAD_(i,j-1) Sel 0 0 SAD_(0,0) SAD(1,1) SAD(1,0) SAD_(2,0) SAD_(2,1) SAD(N-1,0) SAD(N-1,1) SAD(i,j) SAD_(0,1)

Figure 3-10: Architecture of the second phase.

When the index counter is in the range of the SMVs, the controller generates the enable signal to the corresponding accumulator cell to calculate the SAD value at this searching position. The control signal named as Sel is used to switch the multiplexer to receive the partial SAD value from the previous accumulator cell.

Figure 3-11 shows the execution of the accumulator cells array in the second phase. In this diagram, it assumes that the macro block size N is 8 and the search-ing window is from −8 to 7. The searchsearch-ing scan-direction is row direction. In this illustration, the SMVs are (−8, −2) and (−8, 4) in the first row, (−7, −7) and (−7, 7) in the second row and so on. In order to make the graph concise, the diagram does not show every box of all SMVs as illustration.

(58)

(0,-5~2) (0,0~7)

(7,-3~4) (7,0~7)

(-8,-8) (-8,-2) (-8,4)

(-7,-7) (-7,7)

AC(0,0) AC(0,1) AC(1,0) AC(1,1) AC(7,0) AC(7,1)

(-8,-2~5) (-8,4~11) (-7,-2~5) (-7,4~11) (-7,-7~0) (-7,7~14) (-6,-2~5) (-6,4~11) (-6,-7~0) (-6,7~14) (-1,-5~2) (-1,0~7) (-1,-2~5) (-1,4~11) (-1,-7~0) (-1,7~14) (4,-5~2) (4,0~7) (0,-3~4) (-1,0~7) (0,-7~0) (0,7~14) (5,-5~2) (5,0~7) (6,-5~2) (6,0~7) (7,-2~9) (7,3~10) (8,-2~9) (8,3~10) (14,-2~9) (14,3~10)

Figure 3-11: Execution of the Accumulator Cells Array in the condition of N = 8 and p = 8.

The Motion Vector Selector

In the final step, the Motion Vector Selector receives the matching results from the accumulator cells array and identify the SADs step by step to figure out the the motion vector with minimum distortion from those survived motion vectors.

3.4 Performance Analysis

The proposed algorithm significantly reduces the number of motion vectors that requires costly evaluations. To compare with the other motion estimation algo-rithms, this paper uses two metrics: computation cost and the mean absolute dif-ference (MAD). Since the major operation of motion estimation algorithms is addition, we approximately consider the total number of equivalent addition, de-noted as εadder, required for each macro-block as the computation cost. In this

(59)

chapter, the 21 MPEG video clips of CIF format as test bench [38]. Each frame has 352 by 288 pixels and each pixel value is 8-bit gray resolution. The macro-block size N is 16-by-16, and the search window range is from (−16, −16) to (15, 15). The full search algorithm (FS) and a two-phase algorithm, the low reso-lution quantization algorithm (LRQ), are as comparisons with the proposed algo-rithm EFBLA.

Table 3.I. and 3.II. show the quality performance and computational load of these test clips. The results shown in the two tables are the average of 100 frames for each test clip. Obviously, the EFBLA significantly saves 17.47% of the com-putation cost while the MAD degradation is only 0.065 per pixel averagely with comparing to LRQ. Fig. 3-12 (a) to (d) demonstrate the MAD curves of four typ-ical clips and shows that the quality of the EFBLA is very close to the quality of the others. The Akiyo and Weather clips are slow motion and the Children clip is belong to middle motion type. The test sequence Stefan is a type of fast motion. These results show that the EFBLA is capable of lower computational load and having a good quality as well.

3.5 Brief Summary

This chapter proposes a two-phase algorithm and architecture to significantly re-duce the computational load of motion estimation by removing the unlikely mo-tion vectors in the fist phase. As the result of simulating video clips, the quality degradation is very little comparing with FS, only degrading 0.435 per pixel in MAD averagely. In addition, the algorithm features adaptive choosing for the scan direction; it turns out a high degree of data reusability and low memory re-quirement.

(60)

Table 3.I.: Quality degradation analysis for different video clips. Clips FS LRQ EFBLA vs. FS vs. LRQ akiyo 0.605 0.645 0.652 0.047 0.007 children 2.572 2.882 2.930 0.358 0.048 coastguard 5.341 6.309 6.390 1.049 0.081 container 1.564 1.578 1.591 0.027 0.013 dancer 2.696 3.963 3.974 1.278 0.011 destruct 4.022 4.439 4.475 0.454 0.036 flower 6.000 6.367 6.491 0.491 0.124 foreman 2.838 3.614 3.684 0.846 0.070 hall monitor 2.543 2.678 2.686 0.143 0.008 mobile 8.837 9.053 9.445 0.608 0.392 mother daughter 1.496 1.646 1.645 0.148 -0.001 news 1.197 1.336 1.349 0.151 0.013 paris 2.500 2.732 2.782 0.282 0.049 sean 1.647 1.713 1.725 0.078 0.012 silent 1.723 1.923 1.930 0.207 0.007 singer 0.821 0.885 0.885 0.064 -0.000 stefan 6.615 7.429 7.715 1.099 0.286 table tennis 4.388 5.262 5.298 0.910 0.036 tempete 5.685 6.181 6.336 0.651 0.155 waterfall 2.948 3.152 3.150 0.202 -0.002 weather 0.797 0.830 0.847 0.050 0.017 Average 3.183 3.553 3.618 0.435 0.065

(61)

Table 3.II.: Computational load analysis for different video clips. Clips FS LRQ EFBLA vs. FS vs. LRQ akiyo 711341 69105 57138 -91.97% -17.32% children2 711341 69105 55823 -92.15% -19.22% coastguard 711341 69105 58278 -91.81% -15.67% container 711341 69105 57416 -91.93% -16.92% dancer 711341 69105 58929 -91.72% -14.72% destruct 711341 69105 56073 -92.12% -18.86% flower 711341 69105 57918 -91.86% -16.19% foreman 711341 69105 56121 -92.11% -18.79% hall monitor 711341 69105 56671 -92.03% -17.99% mobile 711341 69105 56742 -92.02% -17.89% mother daughter 711341 69105 57223 -91.96% -17.19% news 711341 69105 56064 -92.12% -18.87% paris 711341 69105 56042 -92.12% -18.90% sean 711341 69105 56911 -92.00% -17.65% silent 711341 69105 56887 -92.00% -17.68% singer 711341 69105 56734 -92.02% -17.90% stefan 711341 69105 57402 -91.93% -16.94% table tennis 711341 69105 57469 -91.92% -16.84% tempete 711341 69105 56905 -92.00% -17.65% waterfall 711341 69105 58352 -91.80% -15.56% weather 711341 69105 56631 -92.04% -18.05% Average 711341 69105 57035 -91.98% -17.47%

(62)

0 10 20 30 40 50 60 70 80 90 100 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

The test clip:Akiyo

frame number MAD FS LRQ EFBLA 0 10 20 30 40 50 60 70 80 90 100 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

The test clip:Children

frame number MAD FS LRQ EFBLA (a) (b) 0 10 20 30 40 50 60 70 80 90 100 2 3 4 5 6 7 8 9 10 11

The test clip:Stefan

frame number MAD FS LRQ EFBLA 0 10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

The test clip:Weather

frame number MAD FS LRQ EFBLA (c) (d)

Figure 3-12: MAD curves of FS, LRQ and EFBLA for four clips. (a) The Akiyo Clip. (b) The Children Clip. (c) The Stefan Clip. (d) The Weather Clip.

(63)

Chapter 4 Power-Aware Algorithm and

Architecture

This chapter presents a power-aware architecture based on subsample algorithms to perform graceful tradeoffs between power consumption and compression qual-ity while the battery status changes [39–41]. As the available energy decreases, the algorithm raises the subsample rate for maximizing battery lifetime. As shown in experimental results, the proposed algorithm and architecture can dynamically operate at different power consumption modes with little quality degradation ac-cording to remaining capacity of battery pack.

This chapter is organized as follows. In Section 4.1 and 4.2, we will intro-duce the motivation and background of power-aware paradigm. Section 4.3 and 4.4 present generic and content-based subsample algorithms in detail. Section 4.6 describes the proposed power-aware architecture and section 4.7 shows the performance analysis. Finally, Section 4.8 is the conclusion of this work.

(64)

4.1 Motivation

Motion estimation (ME) has been notably recognized as the most critical part in many video compression applications, such as MPEG standards and H.26x, which tends to dominate most computational load and hence power requirements. With increasing demand of battery-powered multimedia devices, an ME architecture that can be flexible in both power consumption and compression quality is highly required. The requirement is driven by user-centric perspective [42]. Basically, users have two thoughts on using portable devices. Sometimes, users might want extremely high video quality at the cost of reduced battery lifetime. At other times, users might want acceptable quality for extending battery lifetime.

This chapter, therefore, intends to presents a novel power-aware ME architec-ture using a content-based subsample algorithm, which can adaptively perform tradeoffs between power consumption and compression quality as the battery sta-tus changes. The proposed architecture is driven by a content-based subsample algorithm that allows the architecture to work at different power consumption modes with acceptable quality degradation. Since the control mechanism and data sequences at different power consumption modes are the same in the architecture, the power-aware algorithm can switch power consumption modes very smoothly on the fly. The block diagram shown in Fig. 4-1 illustrates a typical application of the proposed power-aware ME architecture. The host processor monitors the remaining capacity of battery pack and switches the power consumption modes. According to the power mode, the power-aware architecture sets the subsample rate and calculates the motion vector (MV) for motion compensation. Note that most portable multimedia devices, in practice, have the battery monitor unit and power management subroutines. Besides the power-aware motion estimation unit,

(65)

all the units marked as gray background also can be designed with power-aware capability to facilitate this portable system to be friendlier for the battery usage. In this chapter, the thesis focuses the target to the power-aware motion estimation based on the content property.

Lots of published papers have presented efficient algorithms for VLSI imple-mentation of motion estimation, on either high performance or low power design. Yet, most of them cannot dynamically adapt the compression quality to different power consumption modes. Among these proposed algorithms, the Full-Search Block-Matching (FSBM) algorithm with Sum of Absolute Difference (SAD) cri-terion is the most popular approach for motion estimation because of its consid-erably good quality. It is particularly attractive to the ones who require extremely high quality. There are many types of architectures that have been proposed for the implementation of FSBM algorithms [8, 11, 12, 15]. However, they require a huge number of comparison/difference operations and result in high computation load and power consumption. To reduce the computational complexity of FSBM, researchers have proposed various fast algorithms. They either reduce search steps [17–19, 21, 43, 44] or simplify calculations of error criterion [13, 29, 34, 45]. By combining step-reduction and criterion-simplifying, some researchers proposed two-phase algorithms to balance the performance between complexity and quality [31, 32, 46]. They first use FSBM with a simplified matching criterion to generate candidate vectors and then select the best motion vector from these candidates with SAD criterion. These fast-search algorithms have successfully improved the block matching speed while the quality degradation is little and, thus, lead to a low power implementation. However, a low power implementation is not necessarily a power-aware system in that a power-aware system should adaptively modify its

(66)

Video Input Host Procesor Host Procesor Battery Monitor Unit Battery Monitor Unit Battery Pack Battery Pack System Power Host Bus Display Peripheral Display Peripheral Input Peripheral Input Peripheral Communication Peripheral Communication Peripheral Pre-Processor Pre-Processor Frame Memory Frame Memory DCT DCT Q Q Q-1 Q-1 IDCT IDCT Frame Memory Frame Memory Motion Compensation Motion Compensation ME Processor ME Processor VLC Encoder VLC Encoder Motion Vector Reference MB Current MB Stream Buffer Stream Buffer Stream Out Video Accelator Peripheral Power Mode Power Mode

Figure 4-1: The system block diagram of a portable, battery-powered multimedia device.

(67)

behavior with the change of power/energy status and balance the performance be-tween quality and battery life [47]. The requirement for ME algorithms to be suit-able for power-aware design is high degree of scalability in performance tradeoffs. Unfortunately, the fast algorithms mentioned above do not meet the requirement.

Articles in [24, 48] present subsample algorithms to significantly reduce the computation cost with low quality degradation. The reduction of computation cost implies the saving of power consumption. Since the power consumption can be reduced by simply increasing the subsample rate, the subsample algorithms have high degree of scalability and are very suitable for power-aware ME archi-tecture. However, applying subsample algorithms for power-aware architecture may suffer from aliasing problem in high frequency band. The aliasing problem degrades the compression quality rapidly as the subsample rate increases. To alle-viate the problem, we extend traditional subsample algorithms to a content-based algorithm, called the content-based subsample algorithm (CSA). In the algorithm, we first use edge extraction techniques to separate the high-frequency band from a macro-block and then subsample the low-frequency band only. Combining the edge pixels and subsample pixels, the algorithm generates a turn-on mask for the architecture to limit the switch activities of processing elements (PEs) in a semi-systolic array. By doing so, we can have significant power consumption save and keep the quality degradation little as the subsample rate increases. Because the number of high-frequency pixels varies with different video clips, we use an adaptive control mechanism to set the threshold value for edge determination and make the number of masked pixels stationary for a given power mode.

The CSA can be used in most existing ME architectures by turning off PEs accordingly with subsample rate. In this chapter, we will present a semi-systolic

以內容特徵為基礎之運動向量估測演算法及架構研究

國

立

交

通

大

學

電機與控制工程學系

博 士

論

文

以內容特徵為基礎之

運動向量估測演算法及架構研究

On Study of Content-Based ME

Algorithms and Architectures

研 究 生：鄭 顯 文

指導教授：董 蘭 榮 博士

運動向量估測演算法及架構研究

On Study of Content-Based ME

Algorithms and Architectures

研 究 生：鄭顯文 Student：Hsien-Wen Cheng

指導教授：董蘭榮 Advisor：Lan-Rong Dung

國 立 交 通 大 學

電 機 與 控 制 工 程 學 系

博 士 論 文

Contents

List of Figures

List of Tables

Abstract(Chinese)

Abstract(English)

Acknowledgement

Chapter 1

Overview

1.1 Background

1.1.1 Video Coding System

1.1.2 Motion Estimation

1.2 Objectives

1.3 Organization of this Dissertation

Chapter 2

Related Works

2.1 Full Search Block Matching Algorithm

2.2 Fast Search Algorithm

2.2.1 Reduce the Searching Steps

2.2.2 Simplifying the Matching Criterion

(a)

(b)

(a)

(b)

Large Diamond

Search Pattern

Small Diamond

Search Pattern

2.2.3 Two-Phase Algorithm

Chapter 3

Edge-driven Two-Phase Motion

Estimation

3.1 Introduction

3.2 Algorithm

3.2.1 Edge-matching phase

x-span

y-span

Scan direction

with

row-by-row

Scan direction

with

column-by-column

3.2.2 Block-matching phase

3.3 Architecture

3.3.1 The First Phase

3.3.2 The Second Phase

3.4 Performance Analysis

3.5 Brief Summary

Chapter 4

Power-Aware Algorithm and

Architecture

4.1 Motivation

博士

研究生：鄭顯文

指導教授：董蘭榮博士

研究生：鄭顯文 Student：Hsien-Wen Cheng

國立交通大學

電機與控制工程學系

博士論文