Terminology

Chapter 2 Background

2.3 Rate Control for H.264

2.3.2 Terminology

A. Definition of Basic Unit

Suppose that a frame is composed of N_mbpic macroblocks. A basic unit is defined to be a group of contiguous macroblocks which is composed of N_mbunit macroblocks where N_mbunit is a fraction of N_mbpic. Denote the total number of basic units in a frame by N_unit, which is computed by:

mbunit mbpic

unit N

N = N (6)

Examples of a basic unit can be a macroblock, a slice, a field or a frame.

B. A Fluid Flow Traffic Model

Fig. 2-3 Fluid Flow Traffic Model

We shall now present a fluid flow traffic model to compute the target bit for the current coding frame. Let N_gop denote the total number of frames in a group of

picture (GOP), n_i_,_j(i=1,2,L =,j 1,2,L,N_gop) denote the jth frame in the ith GOP,

C. A Linear Model for MAD Prediction

We now introduce a linear model to predict the MAD of current basic unit in the current frame by the actual MAD of the basic unit in the same position of the previous frame. Suppose that the predicted MAD of current basic unit in the current frame and the actual MAD of basic unit in the same position of previous frame are denoted by

MAD and cb MAD , respectively. The linear prediction model is then given by _pb

1 MAD a

MAD_cb = × _pb + (8)

where a₁ and a₂are two coefficients of prediction model. The initial value of a₁ and a₂ are set to 1 and 0, respectively. They are updated after coding each basic unit.

The linear model (8) is proposed to solve the chicken and egg dilemma.

D. HRD Consideration

In order to place a practical limit on the size of decoder buffer, a lower bound and an upper bound for the target bits of each frame are determined by considering the hypothetical reference decoder (HRD) [26]. Compliant encoders must generate

bistreams that meet the requirements of the HRD. The lower bound and upper bound for the nth frame are bounded by L(n_i_{, j}) and U(n_i_{, j}), respectively. It is also shown that HRD consideration is conformed if the actual frame size is always within the range

⎣

^L⁽ⁿⁱ^,^j^),^U⁽ⁿⁱ^,^j⁾

⎦

Let t_r(n_i_{, j}) denote the removal time of the jth frame in the ith GOP. Also let )

be be the bit witch is equivalent of a time t, with the conversion factor being the buffer arrival rate [40]. The initial values of the upper and the lower bound are given as follows: iteratively as follows:

( ) ( ) ( ) ( )

2.3.3 Overview of the original H.264 Rate Control Scheme

With the concept of basic unit, models (7) and (8), the steps of the H.264 rate control scheme are given as follows:

1. Compute a target bit for the current frame by using the fluid traffic model (7) and bound it by HRD.

using the actual MAD of basic unit in the co-located position of previous frame.

3. Allocate the remaining bits to all non-coded basic units in the current frame equally.

( )

where T is the bits allocated for current frame and BUMAD is the predicted MAD _i in the ith basic unit of a frame. MINVALUE is constant, and K is the total number of the basic unit.

4. Compute the quantization parameter by using the quadratic R-D model (5).

5. Perform RDO for each macroblock in the current basic unit by the quantization parameter derived from step 4.

2.3.4 GOP Layer Rate Control

In this layer, we need to compute the total number of remaining bits for all non-coded frames in each GOP and to determine the starting quantization parameter of each GOP. In the beginning of the GOP, the total number of bits allocated for the ith GOP is computed as follows:

( ) ( )

gop c

(

i N_gop

)

The starting quantization parameter of the first GOP is a predefined quantization parameter QP₀. The I-frame and the first P-frame of the GOP are coded by QP₀.

QP0 is predefined based on the available channel bandwidth and the GOP length.

Normally, a small QP₀ should be chosen if the available channel bandwidth is high and a large QP₀ should be used if it is low.

The starting quantization parameter of other GOPs, QP_st, is computed by

( )

sum of quantization parameters for all P frames in the previous GOP. Same as QP₀,

QPst is adaptive to the GOP length and the available channel bandwidth.

2.3.5 Frame Layer Rate Control

The frame layer rate control scheme consists of two stages: pre-encoding and post-encoding.

2.3.5.1. Pre-Encoding Stage

A. Quantization parameters of B frames

Since B frames are not used to predict any other frame, the quantization parameters can be greater than those of their adjacent P or I frames such that the bits could be saved for I and P frames. On the other hand, to maintain the smoothness of visual quality, the difference between the quantization parameters of two adjacent frames should not be greater than 2.

Suppose that the number of successive B frames between two P frames is L and the quantization parameters of the two P frames are QP₁ and QP₂, respectively. The quantization parameter of the ith B frame is calculated according to the following two cases:

Case 1: L=1. In other words, there is only one B frame between two P frames. The

quantization parameter of the B frame is computed by

Case 2: L>1. In other words, there are more than one B frame between two P frames.

The quantization parameters of ith B frame between two P frames are computed by

where α is the difference between the quantization parameter of the first B frame and QP₁, and is given by where the video sequence switches from one GOP to another GOP.

B. Quantization parameters of P frames

The quantization parameters of P frames are computed via the following two steps:

Step 1 Determine a target bit for each P frame.

Step 1.1 Determination of target buffer occupancy.

We predefine a target buffer level for each frame according to the frame sizes of the first I frame and the first P frame, and the average complexity of previous coded frames. The function of the target buffer level is to compute a target bit for each P frame, which is then used to compute the quantization parameter. Since the quantization parameter of the first P frame is given at the GOP layer, we only need to

predefine target buffer levels for other P frames in each GOP.

After coding the first P frame in the ith GOP, we reset the initial value of target buffer level as

The target buffer level for the subsequent P frames is determined by

W is the average complexity weight of P pictures, respectively.

In the case that there is no B frame between two P frames, Equation (19) can be simplified as fullness is exactly the same as the predefined target buffer level, it can be ensured that each GOP uses its own budget. However, since the rate-distortion (R-D) model and the MAD prediction model are not accurate [18][19], there usually exists a difference between the actual buffer fullness and the target buffer level. We therefore need to compute a target bit for each frame to reduce the difference between the actual buffer fullness and the target buffer level.

Step 1.2 Microscopic Control (target bit rate computation).

The target bits allocated for the jth frame in the ith GOP is determined based on

the target buffer level, the frame rate, the available channel bandwidth and the actual buffer occupancy as follows:

))

The number of remaining bits should also be considered when the target bit is computed.

If the last frame is complex and uses excessive bits, more bits should be assigned to this frame. The target bit is a weighted combination of ~( )

, j

Step 2 Compute the quantization parameter and perform RDO.

The MAD of current P frame is predicted by the linear model (8) using the actual MAD of previous P frame. Then, the quantization parameter Qˆ corresponding to _pc the target bit is computed by using the quadratic model (5).

The quantization parameter is then used to perform RDO for each macroblock in the current frame by using the method.

2.3.5.2. Post-Encoding Stage

Finally, there are three major tasks in this stage: update the parameters a₁ and a2 of linear model (8), the parameters X₁ and X₂ of quadratic R-D model (5), and determine the number of frames needed to be skipped.

2.3.6 Basic Unit Layer Rate Control

macroblocks) , an additional basic unit layer rate control should be added in the scheme.

Same as the frame layer, we shall first determine the target bit for each P frame.

The process is the same in that at the frame layer. The bits are then allocated to each basic unit. First, the MADs of all non-coded basic units in the current frame are predicted by linear model (8) using actual MAD of bask unit in the same position of previous frame, and we allocate the remaining bits to all non-coded basic units in the current frame by function (11) using these predicted MADs.

Then, we compute the quantization parameter of current basic unit by using quadratic R-D model (5). But, we need to consider the following three cases:

Case 1: The quantization parameter for first basic unit in the current frame is assigned to the average value of quantization parameters for all basic units in the previous frame.

Case 2: If the number of remaining bits for all non-coded basic units in the current frame is less than zero, the quantization parameter should be greater than that of previous basic unit.

Case 3: Otherwise, we shall compute quantization parameter by using the quadratic model.

After all, the RDO process and updating for parameters of linear model and quadratic model is done by the same way as the frame layer.

2.4 Bit Allocation Strategy

In the previous section, we have introduced the rate control strategy in H.264.

And there are many other schemes proposed to improve it.

Pan et al. [28] proposed a new scheme for the bit allocation of each P frame to further improve the perceptual quality of the reconstructed video. A new

least-mean-square estimation method of the R-D model parameters was developed by Nagn et al. [29]. However, these target bit estimation schemes, as an important factor in determining the quantization parameter (QP), are distributing bits to every basic unit equally without considering the complexity of the frame, and it results in poor target bit estimation for different frames.

In [30][31], Ling et al. had proposed a modified algorithm using more accurate frame complexity to allocate bits. While the predicted MAD calculated in linear model (8) is not very accurate, Yu et al. [32] have used a measure named motion complexity of the frame to distribute more bits to high motion scenes. However, these methods only try to allocate more bits to complex frames, and it only results in a general better quality to whole frame.

Since the human visual system (HVS) is more sensitive to the moving regions, it is worthwhile to sacrifice quality of the background regions while enhancing that of the moving regions. Some research works on region/content-based rate-control have been reported [33][34]. They adopted a heuristic approach to decide the quantization parameters for different regions in a frame. Region of Interest (ROI) will obtain a finer quantizer and a coarser quantizer will be used for non-ROI. These methods [33][34] just set quantizers with constants and do not take the contents of region into consideration, and this may cause improper QPs and unreasonable bits used for different regions. So, there are some other improved algorithms trying to adaptively adjust these factors. Lai et al. [35] proposed a scheme which uses a region-weighted rate-distortion model to calculate different QPs for different regions. Sun et al. [36]

also proposed a scheme to allocate bits to foreground and background by utilizing a weighting function for different regions. However, these algorithm [33]-[36] only use fixed values or simple region-based weighting scheme to assign quantization

these regions.

In [37-38], the algorithms that take account of size, motion and priority of the foreground and background regions has been proposed. But these methods adjust the quality of foreground/background by taking the whole foreground as one part. Since there may be multiple objects in the foreground region, we propose an algorithm utilizing the features of different objects to further adjust different quality of these object regions.

Chapter 3 Motion-based Object Segmentation and Feature-based Bit Allocation Scheme

In this chapter, we present our methods for video object segmentation and rate control. In section 3.1, we first go through the whole scheme and give an overview quickly. In section 3.2, we present the object segmentation algorithm. And in section 3.3, the bit allocation strategy for background and foreground objects is presented.

3.1 Overview

Our proposed scheme contains two parts, video object segmentation parts and the bit allocation parts. Since we are focusing on uncompressed video input sources, the object segmentation algorithm is only used with inter-coding frames. In the beginning, we use a multi-resolution algorithm to find the motion vector. In the coarsest level, we establish a object mask and a object set by using coarse motion vectors generated in the motion estimation modules. While in every finer level the multi-resolution algorithm refines the motion vectors, we also use these finer motion vectors to update our objects mask and object set. Then, the object set is then used by bit allocation module. The bit allocation strategy uses the information of objects to judge the importance of foreground objects and background, and then different coding bits will be allocated to these regions to keep the visual quality of foreground objects. The flow of the whole system is illustrated in Fig. 3-1.

Fig. 3-1 System Overview

3.2 Motion-based Video Segmentation Algorithm

The video segmentation algorithm directly takes the raw video data as input to segment the object regions and extracts the object mask for proceeding processing. A multi-resolution pyramid structure has been adopted to find motion vectors and to segment objects by utilizing the motion vectors iteratively. In section 3.2.1, we will present the multi-resolution motion estimation algorithm, and in section 3.2.2, the object localization algorithm will be proposed. The algorithm of updating object regions and morphological operation will be proposed in section 3.2.3 and 3.2.4, respectively.

3.2.1. Multi-Resolution Motion Estimation

For the sake of reducing the computation load segmentation, a multi-resolution motion estimation algorithm has been applied. The multi-resolution algorithm is chosen due to its pyramid structure, robustness and improvements in comparison to the one-level schemes. Since motion clustering is time-consuming, we can utilize the iterative pyramid structure to decrease the complexity by generating a rough mask at the coarsest level and refining it at each finer level.

In the following, we will present the details of the multi-resolution motion estimation scheme that has been used in our system.

Fig. 3-2 Multi-Resolution frame structure

3.2.1.1 Multi-Resolution Frame Structure

The multi-resolution motion estimation we applied is a simple method. First we decompose the input frame into a three layer pyramid by the following sub-sampling function:

( )

^{( )}

⁽ ^, ⁾

4 , 1

0 1

i j I i m j n

I

_k^l

m n

= ∑∑ + +

= =

+ (24)

where ^Ik^{( )}^l⁺¹

( )

ⁱ, ^j represents the intensity value at the position

( )

^i, ^j of the kth frame at level l + 1. The number of pixels in the next upper level is reduced to on fourth of the lower level. The multi-resolution frame structure is illustrated in Fig. 3-2.

The MB size becomes 16 × 16, 8 × 8 and 4 × 4 at levels as 0, 1, and 2, respectively.

The sum of absolute difference (SAD) is widely used as the matching criterion vector in a given search range.

3.2.1.2 Motion Search Framework

1) Search at Level 2: We choose two candidates, i.e.,

{

^{( )} 2^{( )}¹

}

1 , MV

MV , based

on the spatial correlation in motion vector fields as well as minimum SAD, and employ them as initial search centers at level 1. MV₁^{( )}¹ having the minimum SAD are found by full search within a search range SR₂:

SR and w is the predefined search

range by encoder.

( )¹

MV2 is predicted from adjacent motion vectors at level 0 via a component-based median predictor.

2) Search at Level 1: Local search are performed around the two candidates in order to find a motion vector candidate for the search at level 0.

( )

where

3) Search at Level 0: A final motion vector is found from a local search around ( )0

3.2.2. Object Localization

At the coarsest level, after multi-resolution motion estimation, object localization algorithm is used to locate potential objects in a video sequence for subsequent object based bit allocation. Initially, we check if there is any camera motion of each frame and compensate motion vectors with global motion if camera motion happens.

Otherwise, noisy motion vectors are eliminated directly without motion compensation.

Subsequently, motion vectors that have similar magnitude and direction are clustered together and this group of associated macroblocks of similar motion vectors is regarded as an object. The overview of the algorithm of object localization is shown in Fig 3-3.

Fig. 3-3 Object localization algorithm

3.2.2.1. Global Motion Estimation

To correctly locate the position of objects, global motion (camera motion) such as panning, zooming, and rotation, should be estimated for compensation. In this section, a fast and simplified global motion detection algorithm is proposed.

Many global motion estimation algorithms have been proposed, and are based on the motion model of two (translation mode), four (isotropic model), six (affine model), eight (perspective model), or twelve parameters (parabolic model). They can be classified into three types: frame matching, differential technique, and feature points based algorithm.

Since all the method based on motion model need heavy computation, we propose a simple algorithm to calculate global motion by using histogram to reduce the complexity.

The histograms of magnitude and direction of motion vectors are computed to acquire dominant motion direction and dominant motion magnitude to further identify whether global motion, pan and tilt, happens or not. Using the approach of histogram-based dominant motion computation, we can avoid matrix multiplications, which are computationally inefficient when motion vectors are fit to motion model.

The magnitude and direction of camera motion are obtained by using the equations below:

where DMH and DAH are the dominant magnitude and dominant direction of motion vector histogram, respectively, SDMH is the summation of three bins (_i Bin_DMH₋₁_,_i,

BinDMH_, ,Bin_DMH₊₁_,_i) of the magnitude histogram of the i^th frame, SDAH_i is the

summation of three bins (Bin_DAH₋₁_,_i, Bin_DAH_,_i, Bin_DAH₊₁_,_i) of direction histogram

of the i^th frame, and N(Bin_j_,i) means the value of the j bin in the ^th i frame. ^th In the ideal situations, macroblocks in an object would have the same motion magnitude and direction. However, although the entire objects moves toward the same direction, some regions in the object might have different but similar motion magnitude and direction because objects in real world are not rigid in their shape and size. Consequently, to tolerate the error of motion estimations, the values of

BinDMH₋₁_, , Bin_DMH_,_i and Bin_DMH₊₁_,_i of magnitude histogram are summed to examine whether the summation SDMH is larger than the threshold or not, and the _i values of Bin_DAH₋₁_,_i,Bin_DAH_,_i andBin_DAH₊₁_,_i of direction histogram are summed to examine SDAH . If _i SDMH and _i SDAH are both larger than the threshold _i

global

T

, global motion happened, and DMH and DAH are identified as magnitude and direction of camera motion in i frame. Moreover, motion vectors are compensated ^th with the magnitude and direction of global motion for further processing.

3.2.2.2. Object Clustering

We use region growing approach to cluster macroblocks that have motion vectors with similar magnitude and direction together and this group of associated macroblocks of similar motion vector is regarded as an object,. Detailed algorithm is presented in the following.

Object Localization Algorithm

Input: Coarsest layer of the input frame

Output: Object sets {Obj₁, Obj₂, …, Obj_n}, where n is the total number of objects in frame. Each object size is measured in terms of the number of

在文檔中以特徵為基礎的視訊編碼位元配置結構 (頁 19-0)