• 沒有找到結果。

Chapter 1 Introduction

1.5. Organization of this Dissertation

This dissertation is organized as follows. In Chapter 2, we propose a rate distortion bandwidth efficient motion estimation algorithm to solve the intensive data access problem of motion estimation. In addition, a bandwidth aware motion estimation algorithm is also proposed in this chapter to efficiently allocate the memory bandwidth for motion estimation according to the available bandwidth constraints. For SVC, some data efficient Inter-layer prediction algorithms have been proposed in Chapter 3 to increase the data reusability between spatial layers and thus increase the SVC system performance. In Chapter 4, a mode pre-selection algorithm is proposed to skip the potentially ignorable prediction modes before entering the fractional motion estimation so that the heavy computational complexity of fractional motion estimation can be released. In addition, to further increase the motion estimation performance, a search range adjust algorithm and search range aspect ratio decision algorithm with considering both of IME and FME information is proposed in Chapter 5 to improve the coding performance of motion estimation. Finally, some conclusions and future works are given in Chapter 6.

13

Chapter 2

RD Bandwidth Efficient Motion

Estimation and Its Hardware Design

with On-Demand Data Access

14

2.1. Introduction

Motion estimation (ME) not only contributes to the most of coding efficiency but also is the most computational intensive as well as data bandwidth intensive component in modern video encoding design [9]. ME finds out the best matching block by matching its current coding macroblock (MB) (a block with 16×16 pixels) with search candidates within the search range.

A direct approach called full search, which searches all candidates, has been widely used in hardware implementations [17] due to its simplicity and regularity. With its regularity, it is easy to fully reuse the overlapped search range data between different searches [16] to reduce the required access and several hardware architectures were proposed in [18]-[20] to take the advantages of search range data overlapping. However, the bandwidth requirement is still high due to its content independent data loading. Other approaches like fast ME algorithms [21]-[27] can reduce the computational complexity but they do not help a lot to reduce the required bandwidth for irregular search pattern.

To deal with data bandwidth problem, works in [28]-[33] proposed various efficient ME architectures to reduce data bandwidth requirements. In [28], the authors proposed a modified MB processing order to further increase the data reuse of Level C data reuse scheme [30]. For multiple reference frames motion estimation in H.264, [29] proposed a single reference frame multiple current macroblocks (MBs) scheme to reuse the reference data. An alternative direction to reduce the data bandwidth requirement was introduced in [30], [31] by decreasing the size of search range centered at the motion vector predictor (MVP). In [30], the authors efficiently selected the search area to reduce the data bandwidth requirement by tracing the motion vectors. [31] used a two-step windowing approach to dynamically decide whether to load more reference data for motion estimation. In hardware design, [32], [33] proposed the hardware architectures to reduce the memory bandwidth overhead by adopting the binary motion estimation mechanism which executed the matching process on the generated bit map

15

instead of on the pixels. However, these works still require large and constant bandwidth support from a system because of the adopted full search algorithm. Large bandwidth requirement increases the cost as well as power consumption. Besides, assumption of constant bandwidth support is not practical for a modern complex system-on-a-chip due to high varieties of processing tasks. In addition, they do not consider how to efficiently use the available bandwidth or adapt the search process according to available bandwidth, which would be vital to portable video applications like video phones due to costly DRAM power consumption. As a result, above bandwidth policy implies two consequences: either over design to meet the demands or design unchanged with insufficient bandwidth supply that results in quality degradation or coding time increase.

To solve above problems without the side effects, this paper presents an on-demand data access efficient ME and its hardware realization under the rate distortion (RD) framework.

Based on the introduced bandwidth-rate-distortion model, the proposed approach can minimize and predict the data bandwidth requirement, and access required data on demand while maximizes the rate-distortion performance within available bandwidth constraint. The on-demand data access can reduce the unnecessary data access and thus minimize the search range buffer. The simulation results show that the proposed algorithm can effectively allocate and reduce the required data bandwidth with negligible quality loss. Such efficiency also brings the low cost benefits for the resulted hardware implementation for its smaller search range buffer than other designs.

The organization of this section is as follows. Section II first briefly reviews the previous work and modeling the memory bandwidth of ME. Section III presents the proposed bandwidth aware ME algorithm and the simulation results are shown in Section IV. The hardware design and its implementation results are exhibited in Section V to demonstrate the efficiency of our proposal. Finally, a conclusion is given in Section VI.

16

2.2. Data Bandwidth Modeling for ME

2.2.1. Search Algorithms and Its Memory Bandwidth Modeling

In ME search algorithms, the full search ME loads all pixels inside the search area from external memory to find out the best matching MB by exhaustively checking every candidate position. The data bandwidth per frame for full search ME (DBFS) can be calculated as follows.

(2-1) where SRV and SRH indicate the search range in the vertical and horizontal direction, r espectively and #MBsframe refers to the number of MBs per frame.

Due to the high computational complexity and data access of full search ME, several fast algorithms [21]-[24] tried to reduce the number of candidate positions instead of exhaustive checking to decrease the computational complexity of full search ME and thus lessen the bandwidth requirements with less data to be loaded. However, these fast algorithms suffer from the irregular data access and complex data access control and consequently result in the difficulty of hardware realization and ill data reuse.

In addition to check point reduction approach, search range data can be reused to decrease the bandwidth requirement since the adjacent MBs will share a large portion of overlapped search range as shown in Fig. 2-1. A systematic analysis for data reuse in full search algorithm has been proposed in [28]. In which the Level C data reuse scheme as shown in Fig. 2-1 is widely used for ME design due to its high data reuse property. Hence, the data bandwidth per MB for full search with Level C reuse (DBFS_LevelC) can be computed as follows.

(2-2) For an example with QCIF image format and ±8 search range, the data bandwidth

17

requirements are 33Kbytes and 99Kbytes per frame for full search with and without Level C data reuse, respectively. Although full search with data reuse scheme can greatly reduce the bandwidth requirements for ME, it still needs high bandwidth requirements in case of large search window, which could occur in large size video. Besides, the search range for full search in hardware implementation is usually larger than other algorithms to keep the quality since hardware regularity forces its search center at (0, 0) instead of motion vector predictor [34]. For an example of 480p frame size, the bandwidth will be 3849.19 Kbytes per frame if the search range of ±64 is used.

Search area of MB1

Search area of MB2

Non-overlapped area One MB wide

Data reused area

Fig. 2-1 Illustration of data reuse scheme for ME

2.2.2. Nearest Neighbours Search Algorithm and Its Bandwidth Modeling

To gain the benefits both from data reuse scheme and search range size reduction with MVP, this paper adopts nearest neighbours (NN) ME [35] for its highly data reuse in consecutive search and small search range due to MVP.

Fig. 2-2 shows the concept of nearest neighbours algorithm. It first calculates the MVP for search process and the Sum of Absolute Differences (SAD) of five positions centered at MVP (labeled by 1). If the minimum SAD is located at center position, the search operation is finished and the coordinate of center position is set as motion vector of current coding MB.

Otherwise, the position with minimum SAD is set as the search center and another three positions labeled by 2 are checked. This operation is repeated until the position with minimum SAD is located at center or the search boundary is reached.

18

With above operations, Fig. 2-3 shows the data overlapping cases for different nearest neighbours search results. In this figure, the lighter pixels indicate the overlapped area and the dark pixel row or column is referred to the additional pixels to be loaded. From this figure, we can observe that there is a large portion of data overlapping between any two adjacent search positions, which leads to highly regular data reuse for hardware implementation. Therefore, by adopting nearest neighbours search pattern, only one extra column or row of pixels have to be loaded for search process.

4

Fig. 2-2 Example of nearest neighbours search algorithm

1 Fig. 2-3 Data overlapping for different nearest neighbours search results

The bandwidth requirements of nearest neighbours pattern are analyzed as follows. In the first step, 18×18 reference pixels for first five search points should be loaded from external memory to evaluate the SADs. If the minimum SAD is located at the center position, no more pixels are needed from external memory. Otherwise, 18 additional pixels are needed to be loaded from external memory for evaluation, if the position with minimum SAD is located at any one of four corner positions. Therefore, the data bandwidth per MB of nearest neighbours search pattern (DBNN) can be formulated as follows.

(2-3)

19

where 18× 18 indicates the required pixels for computing the SADs for first five positions and the n is the remaining steps to search the best result. With this, we can model the data requirements of nearest neighbours search pattern with step n. With step n, the data requirement for ME can be adjusted freely for different quality.

2.3. Proposed Framework

2.3.1.

RD Bandwidth Modeling for Video Contents and Search Steps

The SAD is commonly used as similarity measurement in ME process due to its computational simplicity, but failed to consider the coding rate brought by motion vector encoding. Therefore, the commonly used Rate-Distortion cost (RDCost) is adopted in this paper and can be calculated as follows.

(2-4) where λMotion indicates the Lagrange multiplier and the term R(MV-MVP) represents the number of bits for coding the motion vector difference between the motion vector (MV) and the predicted motion vector. Through the adoption of RDCost, the best rate-distortion performance can be achieved.

Fig. 2-4 shows the relationship between the rate distortion performance improvement and search steps of the nearest neighbours search. In which the vertical axis is the percentage of RDCost improvement and the horizontal axis is the steps of n in nearest neighbours pattern ME. This simulation uses JM11 reference software [36], quantization parameter (QP) with 28,

±16 search range and only 16x16 block size for simplicity. The mechanism of variable block size is not included for simplicity due to the SADs of other small block size can be obtained from the data of 16x16 block size in SAD tree based hardware realization, and it will not affect the data bandwidth. Thus, the RDCost improvement can be derived as follows.

20

(2-5) where RDCost0 and RDCostn indicate the RDCosts after the first and n+1 steps search of nearest neighbours search, respectively. From these figures we can observe that the ∆RDCost is increased significantly in the first few steps. However, the ∆RDCost is increased slightly after more steps. For example, for the Akiyo sequence in Fig. 2-4(a), the ∆RDCost would be stable after three steps. For Football sequence in Fig. 2-4(b), sixteen steps are required for the

∆RDCost stabilization. Therefore, it is unnecessary to search too many steps since the improvement would be negligible after checking certain number of steps. Meanwhile, this also helps to save the data bandwidth requirement if the required steps can be properly predicted.

From Fig. 2-4, we can obtain three properties. First, the convergence speed of the sequence is content dependent. For example, the high motion sequence such as Football has slower convergence speed than the slow motion sequence such as Akiyo. This property mainly comes from that the high motion sequence needs more search steps to find out the best result. The second property is that the magnitude of ∆RDCost is also content dependent. For instance, the

∆RDCost of Akiyo sequence is smaller than that of the Football sequence since most MVs in low motion sequence have their best MV highly around the MVP. The last property is that the magnitude of RDCost significant influences the ∆RDCost convergence speed. That is, the sequences with smaller RDCost like Akiyo would have faster ∆RDCost convergence speed than the sequences with larger RDCost like Football.

By combining three above properties, we can use the RDCost and ∆RDCost of the first few steps to predict the data bandwidth requirement of ME process while maintain best rate distortion performance. Therefore, the convergence speed influenced by ∆RDCost can be modeled by the following equation.

21

(2-6) where α is RDCost controlling factor which is related to initial RDCost and defined by Eq.(2-7) and the β stands for the bandwidth adjustment factor to achieve trade-off between rate distortion performance and data bandwidth requirements. That is, the larger the β is, the less the data bandwidth is necessary. Oppositely, the larger β also results in more rate distortion performance degradation. The β is set to 0.1 in this paper empirically to achieve best trade-off.

Furthermore, since different RDCost magnitude might result in different convergence speed as mentioned in the last property, the RDCost0 is used to derive the factor of α by following equation.

(2-7) The γ is set to 0.001 empirically. Therefore, through Eq.(2-6), the relationship between RDCost improvement and search steps can be described and its fitting results are shown as the curve labeled with “Model” in Fig. 2-4. Finally, the steps needed for executing nearest neighbours search while keeping acceptable rate distortion performance degradation can be decided by the following equation by considering the maximum allowed search range size.

(2-8) where SRMax indicates the maximum size of search range. After the step number n has been estimated by Eq.(2-8), the allocated data bandwidth for current MB (DBAllocated_MB_i) can be obtained as follows.

(2-9) In summary, we can model the relationship of rate distortion performance and bandwidth

22

by the search steps, initial RDCost and subsequent RDCost improvement according to (2-6).

These two RDCost terms faithfully reflect the characteristics of video content. By this modeling, we can further determine the optimal steps to maximize the rate distortion performance under bandwidth constraints.

However, it is worth to mention that any other fast algorithm can also be used in the optimization framework of this paper if it satisfies the property of highly data reuse in its consecutive search, such as Full search, Logarithmic search [37], and One-at-a-time search [38]. To fit different kinds of search algorithms, Eq.(2-3) and Eq.(2-6)-(2-9) should be really remodeled. For example, the one-at-a-time search algorithm checks three candidates in the first step so that 16×18 pixels are loaded from external memory. For each search candidate, 16 pixels in average are demanded to be loaded for cost evaluation. Therefore, Eq.(2-3) can be rewritten as (16× 18)+n× 16 and Eq.(2-6)-(2-9) should be remodeled. The modeling of Eq.(2-6)-(2-9) can be done by a curve fitting tool once the relationship between rate distortion cost and search step is derived.

2.3.2.

Proposed Bandwidth Aware ME Algorithm

Fig. 2-5 shows the flow chart of proposed bandwidth aware ME algorithm. It operates as follows. First we initialize the available bandwidth for current frame. Then, we execute nearest neighbours search for the step0, step1, and step2 to obtain ∆RDCost1 and ∆RDCost2. If the minimum RDCost is located at center position, the search process is finished. Otherwise, the ∆RDCost1 and ∆RDCost2 from previous steps are used to calculate the corresponding data bandwidth requirement of current MB. Afterwards, more steps are applied to search the best result according to the allocated data bandwidth. After finding the best result, the available data bandwidth pool is updated for further usage. The details of proposed algorithm are described as follows.

23

(a)

(b)

(c)

Fig. 2-4 The relationship between the steps n of nearest neighbours pattern and RDCost and modeled results for sequences of (a) Akiyo, (b) Football, and (c) Foreman

-0.80

24

Fig. 2-5 Flowchart of proposed bandwidth aware ME algorithm

2.3.2.1. Bandwidth Initialization

In the first step, the total data bandwidth (DBTotal) for a certain encoding period is calculated by considering the available system bandwidth as follows.

(2-10)

(2-11) where BWBus is the data transmission rate (Bytes/Second) of bus, FR indicates the frame rate, and the DBUsed indicates the used data bandwidth. In this paper the certain encoding period is defined as the group of pictures (GOP) and GOP refers to the number of frames per coding group. This total data bandwidth, DBTotal, stands for the available bandwidth per GOP.

However, the available bandwidth for ME should subtract the bandwidth requirements for

25

basic quality coding. Thus, the available data bandwidth (DBAvailable) should be as follows.

(2-12) where DBBasic is the necessary data bandwidth requirements for step0, 1, and 2 in nearest neighbours search algorithm and can be calculated by Eq. (2-3) so that DBBasic = (18× 18)+2× 18 = 360 bytes.

2.3.2.2. Bandwidth Allocation for Current MB

After initialization step, the data bandwidth usage, DBMB_i, for current MB i is allocated as follows.

(2-13) where DBAllocated_MB_i stands for the allocated data bandwidth for current i-th MB by Eq.

(2-6~2-9) and the MBRemained is the number of un-encoded MBs. In this allocation, the allocated bandwidth by Eq. (2-9) will be adopted only if the allocated bandwidth is smaller than average remaining data bandwidth per MB. Otherwise, the data bandwidth usage is restricted to average remaining data bandwidth per MB. By this restriction, the problem of over allocation can be avoided. After bandwidth allocation for current MB, the allocated bandwidth should be converted to the corresponding steps in nearest neighbours ME algorithm. The allocated steps can be calculated as follows.

(2-14) For the following step, the nearest neighbours search method is applied according to the allocated steps.

26

2.3.2.3. Data Bandwidth Update

The allocated data bandwidth might not be fully reused due to early termination condition of the search algorithm. Thus, the unused bandwidth can be recycled for further use. Therefore, an additional data bandwidth update stage should be added into the whole system. The data bandwidth update operations are as follows.

(2-15)

(2-16)

(2-17) where DBRemained_MB_i refers to the remained data bandwidth for current MB i and DBUsed_MB_i

is used data bandwidth after executing nearest neighbours search.

27

2.4. Simulation Results

Table 2-I and Table 2-II show the simulation environment settings and test sequences. This simulation uses two scenarios to demonstrate the efficiency of our proposed algorithm. One scenario is to show the data bandwidth savings without data bandwidth constraint. In this scenario, we use the search range to represent BWBus for simplicity and compare with the full search with Level C data reuse scheme since it can achieve the highest data reuse [27],[39],[40]. Another scenario is to demonstrate the rate distortion performance under the data bandwidth constraint. This scenario compares three algorithms including full search (FS),

Table 2-I and Table 2-II show the simulation environment settings and test sequences. This simulation uses two scenarios to demonstrate the efficiency of our proposed algorithm. One scenario is to show the data bandwidth savings without data bandwidth constraint. In this scenario, we use the search range to represent BWBus for simplicity and compare with the full search with Level C data reuse scheme since it can achieve the highest data reuse [27],[39],[40]. Another scenario is to demonstrate the rate distortion performance under the data bandwidth constraint. This scenario compares three algorithms including full search (FS),