Our Goal - Problem Statement - 用於H.264/MPEG-4 AVC可調式視訊編碼標準之快速編碼演算法設計

Section 2.4 Problem Statement

2.4.2 Our Goal

In summary, an H.264/AVC-based scalable encoder adopts the layered coding structure and applies the hierarchical prediction structure in order to produce scalable bit-streams and simultaneously maintain high coding efficiency. The expense of the scalable features and the greatly improved coding efficiency, however, is the penalty of heavily increased computations. They mainly come at a cost of the two nested exhaustive search (selection of temporal prediction type and mode partition) in the hierarchical-B frames at the temporal enhancement layers in each coding layer.

Hence, fast encoding algorithms are thus desirable and advisable for eliminating the unnecessarily computational load in evaluating BI and reducing the enhancement-layer computational complexity while achieving a similar level of coded picture quality. In this dissertation,

Intra-layer correlations in the hierarchical prediction structure

The selection of temporal prediction type in the hierarchical mode partition has a certain consistency in the large block sizes.

The two uni-directional predictions can achieve a similar coding performance as compared to that evaluating all three temporal prediction types. It is particularly true in the small block partitions.

Inter-layer correlations in the mode partition

Between coding layers, the intra predictions are dominated by the IntraBL and Intra4x4. Moreover, the intra prediction direction at the enhancement layer usually selects the one chosen at the base layer or its adjacent predictions.

Between coding layers, the distortions and the rate costs in intra prediction has the log-linear property.

At the enhancement layers, the candidates of block partition can be determined by referring to the selected mode at the reference layer.

The temporal predictions between coding layers have a certain consistency.

The motion vectors found at the base layer can be a good starting point for the enhancement layers.

Chapter 3 Fast Temporal Prediction Selection in H.264/AVC Temporal Scalable Video Coding

In this chapter, we propose a fast algorithm that efficiently selects the temporal prediction type for the dyadic hierarchical-B prediction structure in the H.264/AVC temporal scalable video coding.

Referring to the best temporal prediction type of 16x16, we utilize the strong correlations of prediction type inheritance to eliminate the unnecessary computations for BI prediction in the finer partitions, 16x8/8x16/8x8. In addition, we carefully examine the relationship of motion-rate costs and distortions between the BI and the uni-directional temporal prediction types. As a result, we construct a set of adaptive thresholds to remove the unnecessary BI calculations. Moreover, for the block partitions smaller than 8x8, either FW or BW is skipped based upon the information of an 8x8 partition. Hence, the proposed schemes can efficiently reduce the extensive computations burden in performing the BI prediction. As Compared to the JSVM 9.11 [10], our method saves the encoding time from 60% to 66% for a great number of test videos over a typical range of coding bit-rates and its coding penalty is negligible.

The rest of this chapter is organized as follows. As compared to the hierarchical-B prediction structure, the prior works related to the multiple reference frames and fast algorithms in mode

prediction structure and its decision process of temporal prediction type in JSVM encoder [10]. Its dramatically encoding complexity is also revealed, as compared to the IPPP coding structure. Section 3.3 summaries our observations on the correlations among three temporal prediction types. Based on these analyses, Section 3.4 presents our fast bi-directional prediction selection algorithm. In Section 3.5, our proposed scheme is compared with JSVM 9.11 [10] and the state-of-the-art algorithms [72][73] in terms of complexity reduction and rate-distortion performance.

Section 3.1 Literature Review

To enhance the compression efficiency in the IPPP-coding H.264/AVC [4], a typical way performs the long-term prediction [53], termed the multiple reference frames, to reduce the prediction error potentially. Apparently, its complexity is linearly proportional to the number of used reference frames.

Hence, a number of prior studies have been proposed to reduce the extra computations of this strategy. In [54], the accuracy of motion estimation (integer-pel, half-pel, and quarter-pel) is utilized to classify how the location of the moving object locates in each available reference frame. An object labeled as the same accuracy on two or more reference frames implies that the same shifted texture can be found in those references. Therefore, it is sufficient to select the closest one as the candidate.

Furthermore, in [55] and [56], they study the continuity in moving objects to construct motion maps among different reference frames. Based on the motion trajectory, the initial starting point on farther references has to be conjectured. After the motion from the most recent frame is known, its reference area may cover several blocks. The motions from other farther references are estimated by the

weighted sum [55] or the median [56] of those motions of the covered blocks. With the temporary predictive motions, the truly motions can be searched in a much smaller search range. These two schemes [55][56] make use of the motion information from the first previous frame. Instead of that, the proposed approach [57] obtains the composed motions by a combination of blocks from difference references to provide a more accurately initial guess. Moreover, in [58]–[60] the selection of multiple reference frame is early determined by a set of termination conditions, such as all-zero block detection, the energy of prediction error, and the optimal block partition in the first previous frame. However, for practical use in H.264/AVC [4], Huang et al. [59] empirically illustrate that the coding performance gained by the multiple reference frames is highly dependent on the content of sequences, not on the number of searched references. This observation is then theoretically evidenced in [60]. That is, turning on the multiple reference frames does not usually have noticeable improvement (say, more than 1 dB) in rate-distortion measure, but incurs huge computations in motion search.

On the other hand, a fairly large body of literature has been proposed on the complexity reduction of the H.264/AVC coder [4], based on the estimated rate-distortion cost as thresholds and/or the mode selection. In [61], the candidate modes and the thresholding rate-distortion cost are given by the temporally and spatially neighboring area. Furthermore, the transformed residuals and the corresponding coding bits have a highly linear correlation [62]. Based on the non-zeros quantized transform coefficients, the proposed schemes [62][63] construct the improved rate-distortion

estimator to alleviate the entropy coding and the reconstruction operations during the mode decision process. In order to entirely avoid coding bits computation of residuals, the distortion, the required motions, and the header of block type are used to develop a new cost function of mode selection [64].

Moreover, the so-called early termination is a popular approach in mode selection [65]–[71]. For example, multiple termination criteria eliminate the sophisticated mode search by hierarchically setting from large partitions to small block sizes [65]. In [66] and [67], the sufficient conditions in detecting all-zero blocks are theoretically studied to skip testing unnecessary small partitions. In addition, the spatio-temporal motion characteristics are considered to arrange the mode set [68]–[70].

An object along its moving trajectory determines its motion activity to pick up its dominate modes [68][69], while the spatial motion homogeneity is split into multiple levels to obtain a subset of partition modes [70][71]. All the above schemes are applicable to the IPPP/IBP/IBBP coding structures, in which the coded frame and its reference are very close, but few focus on the hierarchical prediction in the superior H.264/SVC temporal scalability. Moreover, those H.264/AVC-based fast algorithms could not work well, because the correlations between the current frame and its reference may not be sufficiently strong and reliable for being used at lower temporal layers. In [72], the characteristics of low/high-motion areas at low temporal layers are employed to select the block mode at high temporal layers. Lee et al. [73] make use of the statistical hypothesis testing to conditionally skip the partitions smaller than 16x16. However, only the encoding parameters of the mode partitions are considered to apply the fast algorithms in these methods.

Up to now, it is surprised that few researchers pay attention to the selection of temporal prediction types (FW, BW, and BI). Although these three temporal prediction types can provide highly efficient compression, they conduct more than triple of the total motion search calculation, as compared to the IPPP coding structure. Therefore, the aim of this paper is to design a fast temporal prediction selection algorithm for the dyadic hierarchical-B prediction structure in H.264/SVC [2].

To achieve this goal, we statistically analyze the correlations of temporal prediction types in large partitions and show that the BI has limited coding benefits in small partitions [74]. The correlations of motion-rates among three temporal prediction types are examined and they are formulated by a first-order regression model. Additionally, the relationships among the distortions are also investigated [75] and the prediction error of the uni-directional temporal predictions have a jointly Laplacian distribution verified by the goodness-of-fit tests. Hence, based on these observations, we propose a novel scheme that avoids unnecessarily massive BI evaluations through the inheritance in the temporal prediction types and the use of adaptive thresholds in the hierarchical-B prediction structure of H.264/SVC [2]. Our simulations show very promising results. On the average, our approaches can provide up to 66% overall encoder time saving over JSVM 9.11 [10], which is equivalent to three times faster in the encoding process.

Section 3.2 Observations and Analysis on Temporal Prediction at Temporal Enhancement Layers

In this section, we investigate the statistical correlations of three temporal prediction types (FW, BW, and BI) at the H.264/SVC temporal enhancement layers. In the Subsection 3.2.1 we examine the prediction type distributions and their inheritances in the hierarchical blocks from large partitions to small ones. Then, Subsection 3.2.2 analyzes the relative coding efficiency contributed by the BI. In terms of the rate-distortion costs and the motion-rate costs, the last subsection explores their correlations between the uni-directional predictions and BI. These statistical analyses are conducted based on encoding one temporal base layer with four temporal enhancement layers ; that is, the GOP size is 16. To evaluate the impact, two values, and , are tested in the experiments. The training set contains three MPEG test videos: FOOTBALL (CIF, fast motion), FOREMAN (CIF, median motion), and MOBILE (CIF, complicated texture).

3.2.1 Inheritance of Temporal Prediction Types

In this subsection, we collect the probability distributions of temporal prediction types.

3.2.1.1 Prediction Type Distributions

We first gather the probability distribution of temporal prediction types used in several distinct block partitions at various temporal enhancement layers and at different values. In any temporal layer of the hierarchical-B prediction structure, the temporally forward and backward reference frames

have equal distance away from the current frame. If the objects in test have constant movement, the current MB can find its shifted version in either the forward reference or the backward reference. It implies that the selections of FW and BW are nearly equally likely, as shown in Fig. 3-1 for, particularly, the MOBILE sequence.

1 2 3 4

Fig. 3-1 Distribution of temporal prediction types (FW, BW, and BI) at different temporal enhancement layers

The selection of BI is highly dependent on the video content and the values, especially when the block partitions are larger than 8x8. Also, BI is selected more often at the high bit-rates (small ) because the encoder has sufficient bits to reduce the reconstruction distortion. In the MOBILE sequence, the BI probability of the 16x16 partition reaches about 80% at the low temporal enhancement layers, but its probability decreases to 20% or less at the high temporal layers. In general, BI is favored at large partitions (16x16 or 16x8) because BI offers more accurate motion compensation at a small motion bit-rate overhead. The distortion reduction by good moion vectors is less significant in the low-motion or motionless areas. Thus, the FORMAN sequence uses fewer BI types. Clearly, distant reference frames reduce the motion compensation effectiveness. Therefore, BI percentage goes down drastically at higher temporal layers. On the other hand, BI needs to transmit more mv’s, twice as many as those of the FW/BW. If the reduced distortion provided by BI cannot compensate for the increased motion-rate cost, the BI type is not chosen. This is particularly true in the case of small blocks (4x4, 8x8). Therefore, at the same temporal enhancement layer, larger partitions prefer BI, especially for the complex-textured sequence MOBILE.

In summary, BI benefits the 16x16 and 16x8/8x16 partitions at the lower temporal enhancement layers, but for the small partitions from 8x8 down to 4x4, BI is seldom selected. This observation does not seem to be strongly affected by the video contents and the coding bit-rates.

3.2.1.2 Elimination of BI for Large Partitions

Subsection 3.2.1.1, the BI probability in these partitions is much smaller than that in the 16x16 block size. That is, for instance, is less than , which implies . In order to find out whether or not these two groups and overlap with each other, we consider three conditional probabilities defined below:

, (3.1)

, and (3.2)

. (3.3)

In Table 3-1, these three conditional probabilities are higher than 80% in all cases, and are about 95% on average. Moreover, they are very close to one at higher temporal enhancement layers. This strong correlation indicates that the uni-directional prediction types are inheritable from the 16x16 partition to 16x8/8x16/8x8 partitions. Thus, the information of the prior evaluation on 16x16 BI can provide a very reliable estimate to the use of BI for the 16x8/8x16/8x8 partitions at both low and high bit-rates. We can quite accurately eliminate the use of BI in those partitions.

Table 3-1 Conditional probabilities of , , and Test

Sequence FOOTBALL

0.94 0.95 0.96 0.97 0.94 0.95 0.96 0.97 0.96 0.97 0.98 0.98 FOREMAN 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 MOBILE 0.93 0.96 0.97 0.98 0.93 0.95 0.97 0.97 0.97 0.98 0.99 0.99 FOOTBALL

0.85 0.89 0.90 0.93 0.84 0.88 0.90 0.92 0.92 0.93 0.94 0.96 FOREMAN 0.94 0.95 0.97 0.97 0.93 0.95 0.96 0.97 0.97 0.98 0.98 0.99 MOBILE 0.82 0.86 0.88 0.91 0.80 0.83 0.88 0.89 0.85 0.90 0.92 0.94

AVG. 0.94 0.93 0.96

3.2.1.3 Consistency of FW and BW in Small Partitions

We now look into the block partitions smaller than 8x8. We find that we only need to evaluate FW and BW; also, the temporal prediction types of the 8x8 partition are strongly correlated to those of the 4x4 partition. As discussed in Subsection 3.2.1.1, the probabilities using BI for the 8x8 and the smaller partitions are often less than 20% and 10%, respectively. We collect the following conditional probabilities of using FW and BW types. One is defined by

, (3.4)

which is equivalent to . Typically, the

term is less than 2% in our collected data. The probability

can thus be approximated by . Similarly defined

is close to .

Table 3-2 Conditional probabilities of and Test

Sequence FOOTBALL

0.86 0.88 0.90 0.92 0.83 0.86 0.88 0.90 FOREMAN 0.94 0.95 0.96 0.97 0.91 0.93 0.95 0.96 MOBILE 0.85 0.87 0.89 0.90 0.85 0.88 0.89 0.90 FOOTBALL

0.83 0.85 0.87 0.89 0.74 0.83 0.86 0.89 FOREMAN 0.89 0.91 0.93 0.93 0.89 0.91 0.92 0.92 MOBILE 0.88 0.88 0.87 0.87 0.88 0.88 0.87 0.87

AVG. 0.90 0.88

Experiments show that the approximated values of and are fairly close to data in Table 3-2. Moreover, these two conditional probabilities slightly increase at higher temporal enhancement layers, except for the MOBILE sequence with , of which the correlations are rather similar for all temporal enhancement layers. Averagely, the consistency in selecting the same prediction direction can be up to 90%. Thus, the 8x8 prediction mode information serves as a good reference to the smaller partitions.

3.2.2 Rate-Distortion Contribution by BI

In this subsection, we address the relative rate-distortion improvement offered by BI in different block modes. As analyzed before, the hierarchical-B prediction structure takes the advantage of using temporal prediction types to improve the coding efficiency. According to the rate-distortion theory, a temporal prediction type with smaller rate-distortion cost provides better coding efficiency. We adopt the rate-distortion cost function definedproduced by JSVM [10] (which came from essentially the rate-distortion theory) and collect all , , and for three squared-shape partitions, 16x16, 8x8, and 4x4. For each sub-block , we define the relative rate-distortion improvement , offered by the best temporal prediction type , as follows:

(3.5)

The overall relative rate-distortion improvement is the sum of of all sub-blocks; that is, , where . Furthermore, in order to quantitatively

performance index by

, (3.6)

which trivially yields

, (3.7)

where is the average operator and is the number of the sub-blocks. With the Bayes’ theorem, this measure index is rewritten as

. (3.8)

In other words, the term , ranging from 0 to 1, indicates the percentage of the relative rate-distortion improvement contributed by BI for the totality of size blocks. Moreover, a large value shows that the BI has a significantly relative improvement in and that the BI should not be skipped.

15 30 45 60 75

Fig. 3-2 The performance index for individual hierarchical-B frame

Table 3-3 Average for 16x16, 8x8, and 4x4 blocks in each temporal enhancement layer (in percentage) Test

Fig. 3-2 depicts the performance index value for each hierarchical-B frame. As shown, the relative improvement offered by BI at high bit-rates is superior to that at low bit-rates because the encoder has more bits to reduce the distortion. This superiority at different bit-rates is particularly noticeable in the large 16x16 partition. On the other hand, the BI type usually furnishes less than 20% in terms of and , even at high coding rates.

Table 3-3 shows the average benefits offered by the BI type at various temporal enhancement layers. As illustrated, the performance index values decrease as the partition becomes finer. The has an average value of 46.5%; that is, BI plays an important role in improving coding efficiency for large partitions. For the 8x8 partition, the effect of BI plunges to 12.0% on average, which says that the two uni-directional predictions are sufficient to provide good compression.

Furthermore, when the partition is getting finer to 4x4, the contribution of BI can be ignored because is less than 2% typically. However, some test videos such as MOBILE need BI to achieve better rate-distortion performance for both 16x16 and 8x8 partitions, since its and values reach 88.6% and 40.0%, respectively. In conclusion, the BI prediction type offers little coding gain for the block partitions smaller than 8x8.

3.2.3 Rate-Distortion Relationships between Uni-directional Predictions and Bi-directional Prediction

In this subsection, we are interested in the relationships between the uni-directional predictions and the BI in motion-rate cost and residual distortion. We collect the following information in our

experiments: (a) the motion vector difference, (b) the motion-rate cost , and (c) the distortion for the three temporal prediction types. The statistical observations and theoretical analyses on the experimental results are reported below.

3.2.3.1 Motion Vector Difference

In order to find out the correlations of two cost terms and between these temporal prediction types, we examine the motion vector differences after the motion vectors are refined by the BI search process. We look at two square block partitions, 16x16 and 8x8. On the JSVM 9.11 platform [10], we search for the best motion vectors of different prediction types for a specified block partition . Their notations are as follows:

: the FW motion vector of blocks

: the BW motion vector of blocks

: the forward motion vector refined by BI for blocks

: the backward motion vector refined by BI for blocks.

As described earlier, the BI search process takes and as its initial search points for motion estimation. The Euclidean distance to measure the motion vector difference and

. (3.9)

We statistically gather the 16x16 and 8x8 blocks that choose BI as their best temporal prediction type for generating the probability distribution functions (PDF) and cumulative distribution functions

(CDF) of and , as shown in Fig. 3-3.

0 5 10 15 20 25

0 5 10 15

Fig. 3-3 PDFs and CDFs of the motion vector difference for 16x16 and 8x8 blocks with two selected values

(a) 16x16 partition size with

(b) 8x8 partition size with

Fig. 3-4 Distributions of motion-rate costs and

As shown, the PDFs of the motion vector difference are strongly clustered around the starting search points. Particularly, the one-pixel probability is close to 90% for the

在文檔中用於H.264/MPEG-4 AVC可調式視訊編碼標準之快速編碼演算法設計 (頁 57-0)