The remainder of this thesis is organized as follows. Chapter 2 provides a brief review of SHVC and related prior works. Chapter 3 presents our proposed algorithm, including its concept of operation, the weighting schemes for different components, and how their weighting should be adapted in response to the change in coding parameters. Chapter 4 provides experimental results by implementing the proposed scheme with the SHM-1.0 software and conducting tests under the common test conditions [8]. Results for simplified schemes are also presented. Finally, Chapter 5 concludes this work with a summary of our findings and a list of future works.
6
CHAPTER 2
Background
2.1. H.264/AVC Scalable Video Coding
Scalable video coding (SVC) is aimed at generating a bit stream that allows part of it to be removed in such a way that the resulting sub-stream forms another valid bit stream and can still be decoded [10]. With scalability, users can experience graceful degradation of video quality rather than visible errors or interruptions. SVC supports three types of scalabilities: spatial, temporal and quality scalabilities.
The temporal scalability is provided by utilizing the hierarchical temporal prediction structure of the single-layer coding [12].
The spatial scalability makes use of a multi-layer pyramid approach in
which separate encoding loops are employed for different spatial resolution layers and an adaptive inter-layer prediction mechanism is developed to exploit the correlations between layers.
7
And lastly, quality scalability supports different quality levels with varying
quantization parameters for different quality layers at the same spatial resolution.
In H.264/AVC video coding standard, the coded video data is organized into Network Abstraction Layer (NAL) units, which are packets that each contains an integer number of bytes. A set of NAL units in a specified form is referred to as an access unit. The decoding of each access unit results in one decoded picture. For an SVC bit stream, the set of corresponding access units can be partitioned into a base layer (BL) and one or more enhancement layers (ELs) with the following property. For each access unit, first the coding parameters of the BL are determined, and given these data, the ELs are coded. In SVC, the BL and the EL bit stream are coded by separate encoding loops, which are referred as coding layers; and in each layer, the basic concepts of inter-frame and intra-frame predictions are employed as in H.264/AVC.
The correlation between layers is exploited by the Inter-Layer Prediction (ILP) mechanisms which are considered as additional coding options for the EL in the operational encoder control.
A simplification used throughout the discussions for this thesis is to primarily consider a bit stream containing only two layers (e.g. a lower resolution BL and a higher resolution spatial scalability EL). In fact, the SVC design fully support multilayer scenarios including multiple spatial scalability layers and the mixing of spatial scalability layers with other layers that provide temporal or quality scalability.
8
Figure 1: A high-level block diagram of an HEVC encoder
2.2. HEVC and the Scalable Extension to HEVC
In the scope of this thesis, the scalable video coding is an extension of High Efficiency Video Coding (HEVC), so-called Scalable High Efficiency Video Coding (SHVC).
Therefore, a brief review on the coding structure of HEVC, including its similarities and differences with its predecessor H.264/AVC, will be introduced. In addition, the current status of adding scalable extension to HEVC will be analyzed.
2.2.1. Coding Structure of the HEVC
HEVC is a block-based hybrid video coding method which uses the spatial or temporal prediction, followed by transformation. In HEVC, an input frame is first divided into multiple coding tree units (CTUs), which are analogue of macroblock definition in H.264/AVC and other previous standard; but unlike macroblocks, CTUs have a variable size (e.g. 16x16, 32x32, or 64x64 samples). A CTU may contain only one CU (coding unit) or may be split to form multiple CUs, and every sample in a CU is either coded using inter-frame or intra-frame prediction. Also, CUs are divided into one or more multiple prediction units (PUs) and a tree of transform units (TUs). Inside a PU,
9
the same prediction process (e.g. the same intra prediction mode, or the same motion information) is applied to all pixels.
Figure 1 shows a high-level block diagram of an HEVC encoder. For each PU, the residual pixels 𝑟1, … , 𝑟𝑁 are generated by subtracting the predicted pixels 𝑝1, … , 𝑝𝑁 (generated by either the ‘Intra Prediction’ or ‘Inter Prediction’ module) from the input 𝑜1, … , 𝑜𝑁 pixels. Then, those residual values are fed into the ‘T/Q’ module to form the quantized transform coefficients 𝑥1, … , 𝑥𝑁, After that, entropy coding is performed on 𝑥1, … , 𝑥𝑁 and side information, including intra mode and motion parameters, to generate the coded output bit stream. To ensure both the encoder and decoder will generate identical predictions for the subsequent data; the encoder duplicates the decoder processing loop inside. As a result, the reconstructed residual 𝑟̃1, … , 𝑟̃𝑁 is generated by inverse scaling and then undergoes an inverse transformation processes inside the ‘IQ/IT’ module. The ‘In-loop Filter’ module is similar to the de-blocking filter module in H.264, but in addition it contains Sample Adaptive Offset module [3].
2.2.2. Coding Structure of the SHVC
Due to the fact that temporal scalability can generally be achieved by single-layer coding structure, the JCT-VC mainly focuses on developing scalable features to support spatial and coarse grain SNR scalabilities. Conventionally, in order to exploit the correlations between layers, the Inter Layer Prediction (ILP) mechanism is utilized. The ILP employs the coded data, including reconstructed pictures, intra mode, and motion parameters, in the reference layer for effectively coding an enhancement layer. A general block diagram of a two-layer SHVC encoder is depicted in Figure 2.
10
Figure 2: A two-layer spatial scalability encoder using HEVC encoding modules
In [1], two signaling mechanisms, namely the RefIdx and TextureRL approaches, are proposed for ILP. Those tools are described as follows.
ILP in RefIdx-based SHVC: The BL reconstructed picture (which is referred to
as the inter-layer reference picture) and the temporal reference picture are included in the reference picture lists, so that the inter-layer motion or texture prediction can be achieved by simply using the existing inter-frame prediction mechanisms. This approach has benefit of being able to reuse most of the single layer HEVC design, except few high-level (e.g. slice level) syntax changes, for encoding and decoding the enhancement layer.
ILP in TextureRL-based SHVC: This approach allows to be changed the low-level (e.g. CU- and/or PU-low-level) syntax and decoding process. Specifically, the inter-layer motion and the texture prediction are achieved as follows
11
Inter-layer motion prediction: The merge mode in HEVC is to be modified.
In particular, the motion parameters of the collocated BL block is also used to form the merge candidate list. This merge candidate is derived at the location collocated to the central position of the current PU and to be considered as the first candidate in the merge list.
Texture prediction: This mechanism is achieved by introducing a new
prediction mode, named as Intra-BL mode. This mode has to compete with other intra-frame and inter-frame prediction mode in the Rate-Distortion Optimization (RDO) process to determine the best prediction mode in terms of lowest rate-distortion (RD) cost.
Without being constrained to use inter-frame prediction mechanisms, TextureRL approach is more flexible in constructing ILP. For example, a new prediction mode may be created at the PU level to better utilize both the BL and EL information for prediction. The price paid for achieving this is however a less compatible codec for the EL. Furthermore, even though the TextureRL approach has more flexibility, the RefIdx approach can already achieve most of the gain provided by the TextureRL [13][14], making the it a less favorable approach. But, many believe the flexibility of the TextureRL has not been employed to the best advantage. As a result, in SHVC, multiple designs related to combined prediction are proposed in order to explore the potential of the TextureRL approach in forming a better ILP mechanism.
Those proposals use the BL and EL information to improve the coding efficiency of the EL. Specially, the intra prediction improvements based on the reconstructed BL will be described in the following sections.
12
-Figure 3: Combined intra prediction at the enhancement layer in SHVC
2.3. Combined Intra Prediction
Conceptually, in intra prediction, it is difficult to predict the whole block accurately [4] and usually, the bottom right pixels in the block are predicted less accurately because they are far away from the reference and the correlation between them is weaker. Therefore, this inaccurate prediction results in large residual information that reduces the coding efficiency. The inaccuracy in prediction becomes worse when the prediction block size increases. This issue needs to be taken into account especially in the new standard when the coding unit size increases from 16x16 in H.264/AVC to 64x64 in HEVC. As a result, it raises a need to have a mechanism to compensate for this lack of prediction by using the extracted information from the BL.
Figure 3 represents the combined process introduced in SHVC. The ‘Combined Prediction’ block is generally used to extract the information from the BL to
compensate for the EL intra predictor, resulting in the difference between the final prediction and the input block being minimized. Consequently, the prior works in the context of intra prediction improvement at the EL will be presented as follows.
13
Figure 4: Intra DC Correction algorithm, PEL(x,y) and PBL(x,y) are the pixel values of the EL intra predictor and collocated BL reconstructed block, N is the width (or height) of the current block.
2.3.1. Intra DC Correction
The intra DC correction (IDCC) in [7] is based on the observation that the mean value (DC value) of the EL prediction block can sometimes be better estimated from the collocated BL block. As a result, the DC value of the EL intra predictor is replaced by that of the collocated BL block. The DC value of the EL intra predictor and its collocated BL block is calculated by the mean value of all pixels located in the block as follows
𝐷𝐶𝐸𝐿 =∑ 𝑃𝐸𝐿(𝑥, 𝑦)
𝑁2 (1)
𝐷𝐶𝐵𝐿 =∑ 𝑃𝐵𝐿(𝑥, 𝑦)
𝑁2 (2)
The final prediction is derived, at the pixel level, as follow
𝑃𝐹𝑖𝑛𝑎𝑙(𝑥, 𝑦) = 𝑃𝐸𝐿(𝑥, 𝑦) − 𝐷𝐶𝐸𝐿 + 𝐷𝐶𝐵𝐿 (3)
14
Figure 5: Weighted Intra Prediction algorithm, PEL(x,y) and PBL(x,y) are the pixel values of the EL intra predictor and collocated BL pixel, DH and DV are the horizontal and vertical distances of current pixel to its references
2.3.2. Weighted Intra Prediction
The algorithm proposed in [6] is also an effort to improve the intra prediction in the EL by employing the reconstructed BL texture. In general, as the distance between the pixel position and its reference sample increases, the spatial correlation decreases so that the accuracy of the prediction value is reduced. In order to compensate for this, a weighting method is used, in which as the distance between the PEL(x,y) and its references increases, the weight value for the PBL(x,y) also increases. Therefore, the reduction of accuracy of PEL(x,y) due to the increasing distance can be compensated for by exploiting the corresponding value from BL. Finally, in order to generate the final prediction signal, the following processes are applied
𝑃𝑉(𝑥, 𝑦) = (𝑁 − 𝐷𝐻) × 𝑃𝐸𝐿(𝑥, 𝑦) + 𝐷𝐻× 𝑃𝐵𝐿(𝑥, 𝑦) (4) 𝑃𝐻(𝑥, 𝑦) = (𝑁 − 𝐷𝑉) × 𝑃𝐸𝐿(𝑥, 𝑦) + 𝐷𝑉 × 𝑃𝐵𝐿(𝑥, 𝑦) (5) 𝑃𝐹𝑖𝑛𝑎𝑙(𝑥, 𝑦) = (𝑃𝑉(𝑥, 𝑦) + 𝑃𝐻(𝑥, 𝑦) + 𝑁) >> (𝑙𝑜𝑔2(𝑁) + 1) (6)
15
CHAPTER 3
Mode-Dependent Pixel-Based Weighted Intra Prediction (MPWIP)
This chapter presents the proposed mode-dependent pixel-based weighted intra prediction. We start by introducing its concept of operations, showing first the decomposition of the EL intra predictor and the BL reconstructed block into their respective AC and DC components and then the combination of these components based on a pixel-based weighting scheme. As we shall see, the weight value to associate with each component is designed to be a function of the prediction pixel’s position within the block, which we refer hereafter to as the weighting function (it represents the weighting scheme from the perspective of a single component [e.g. the AC or the DC component at the BL or at the EL], characterizing the contribution of one component to estimating a set of sample intensities), and is optimized based on a least-squares fit to a set of training data. Lastly, a thorough analysis is conducted on
16
Figure 6: Proposed Pixel-based Weighted Intra Prediction scheme
how coding parameters, such as the EL intra prediction mode, QP setting, and prediction block size, affect the weighting functions of different components.
3.1. Pixel-based Weighted Intra Prediction
3.1.1. Concept of Operations
The ideas of the proposed scheme are essentially a combination of the two prior works [6] and [7]. As depicted in Figure 6, the texture information of the EL intra predictor and BL reconstructed block are first decomposed into the DC and AC components, where the DC components are computed as a prediction block with all its pixels taking the average value of the input block, i.e., the EL intra predictor or the BL reconstructed block, and the AC components are formed from the residual signals produced by subtracting the DC components from their respective inputs. Each of these components is then weighted by a separate pixel-based weighting scheme and the results are summed together to form the final predictor.
17 3.1.2. Least-Squares Solution
Obviously, how different components are weighted has a crucial effect on the resulting prediction performance. We wish to find a set of weighting functions so that the resulting prediction residual can be minimized. This problem can be solved using the well-known Least-Squares (LS) method.
To ease the understanding of the subsequent discussion, we adopt the following notations: bold lower-case letters represent vectors, BOLD UPPER-CASE letters denote matrices, and italicized lower-case letters are scalars. Moreover, we use 𝐚𝑘 = [𝑎𝑘(1) 𝑎𝑘(2) … 𝑎𝑘(𝑛)] T and 𝐛𝑘 = [𝑏𝑘(1) 𝑏𝑘(2) … 𝑏𝑘(𝑛)] T to represent the predictor values at pixel k that are extracted from the EL intra predictors and the BL reconstructed blocks, respectively, in n collected data from the training process.
Similarly 𝐚𝐜𝑘 = [𝑎𝑐𝑘(1) 𝑎𝑐𝑘(2) … 𝑎𝑐𝑘(𝑛)] T and 𝐝𝐜𝑘 = [𝑑𝑐𝑘(1) 𝑑𝑐𝑘(2) … 𝑑𝑐𝑘(𝑛)] T denote respectively the corresponding values from the AC and DC components. Thus, we have vector whose elements are the weight values to associate with the four components for prediction at pixel k. Specifically, the weight vector represents the weighting scheme
18
from the perspective of a single pixel, describing how the corresponding samples from different components contribute to estimating a current pixel’s intensity.
With reference to the notations above, we further denote by 𝐨𝑘 = [𝑜𝑘(1) 𝑜𝑘(2) … 𝑜𝑘(𝑛)]T the target pixels at position 𝑘 in 𝑛 collected blocks, whose intensity values are to be estimated. The problem of determining the optimal weight vector 𝐰∗𝑘 in the least-squares-error sense can then be formulated as follows:
From the Linear Algebra theorem, it has the closed-form solution
𝐰∗𝑘 = (𝐗𝑘T. 𝐗𝑘)−𝟏. 𝐗𝑘T. 𝐨𝑘 (11) By varying the index k and repeating the same process, we can obtain the weight vectors for different pixel positions and thus the weighting functions for all four components.
3.1.3. Training Process
In order to collect the data for the basis functions to compute the optimal weighting function, the training process is introduced. First, to ensure the collected data is most appropriate, the proposed algorithm will be applied to produce a new prediction mode and this mode has to compete with all conventional modes in the rate-distortion optimization (RDO) process at the EL to find the best mode with lowest rate distortion (RD) cost. Then, those blocks coded in the proposed algorithm will be used to compute the optimal weighting functions with respect to Eq. (11).
It is observed that according to the Eq. (10), on each iteration of training process, the initial values of the weighting functions needs to be assigned. Specifically, in our training process the weight value corresponds to the average value of the EL and the
𝐰∗𝑘 = argmin
𝒘𝑘 (𝐨𝑘 − 𝐗𝑘. 𝐰𝑘)2
(10)
19
BL’s texture information which is utilized for the first iteration. Then, the updated weighting functions are referred to as the input weighting functions for the next iteration; the process is repeated for all the sequences in the training set. Finally, the criteria to terminate the training process are determined with respect to a general consensus is that if the weighting functions are optimized, the mean square error (MSE) of all the coded pixels in the current iteration should be the relatively smaller than that of other iterations. In particular, the training process is to be terminated according to two criteria: 1) The weighting functions are stable (that is, they do not vary considerably compared to previous iteration) and 2) The absolute difference in MSE value between the current iteration and its successive previous iteration is below 1% of MSE value of the successive previous iteration.
However, the resulting weighting functions are only optimized for a specific iteration of the training process. In addition, the sequences of the training set differ from the sequences of the test set (because it is important to test a model by data which is different from that used to develop it); therefore, the obtained weighting functions used to find the bit rate savings for the test sequences would be referred to as the weighting functions that are resulted from the training process.
3.2. Weighting Function
To gain a better understanding of how different components should be weighted in forming a better predictor, this section provides an in-depth analysis of the weighting functions with different components against 1) prediction mode taken by the EL, 2) QP setting of BL and EL, and 3) prediction block size.
20
Figure 7: Pixel coordinate system showing the pixel at (0,0)
For notation, the QP value of BL and EL is specified by a two-tuple representation QP(QPBL, QPEL), and the coordinate system shown in Figure 7 is used throughout the discussion that follows.
3.2.1. Effect of Intra Prediction Mode
This section investigates the effect of the intra prediction mode on the weighting function. Here the prediction mode refers to the intra prediction direction used to generate the EL predictor. Currently, our weighted scheme is restricted to the cases where the EL predictor is produced with Horizontal, Vertical, DC or Planar mode.
Figure 8 show the weighting functions for Vertical mode. It can be seen that those associated with the components from the same layer have a similar waveform, although their magnitudes differ considerably. Moreover, the weight value for the DC component of the EL is seen to be mostly lower than that of the BL, which justifies the IDCC algorithm’s substitution of the BL’s DC value for the EL’s. By comparing, Figure 9, Figure 10, and Figure 11, we can further observe that the weighting functions vary with the prediction mode with which the EL predictor is formed—i.e., they are mode dependent.
x
y
(0,0) Reference
Pixels
Target Block
21
Figure 8: Vertical mode, block size of 16x16, QP(30,30). Each figure corresponds to the weighting function of (a) ACEL, (b) DCEL, (c) ACBL, (d) DCBL
Figure 9: Horizontal mode, block size of 16x16, QP(30,30). Each figure corresponds to the weighting function of (a) ACEL, (b) DCEL, (c) ACBL, (d) DCBL
22
Figure 10: DC mode, block size of 16x16, QP(30,30). Each figure corresponds to the weighting function of (a) ACEL, (b) DCEL, (c) ACBL, (d) DCBL. The ACEL component of the DC mode is not available and it is not weighted in the experiments.
Figure 11: Planar mode, block size of 16x16, QP(30,30). Each figure corresponds to the weighting function of (a) ACEL, (b) DCEL, (c) ACBL, (d) DCBL.
23
Another interesting point to be noted in Figure 10 is that, when the EL is coded in DC mode, in which case the EL predictor contains no AC component, the BL’s texture information dominates the creation of the final predictor, but the contribution from the EL is not insignificant.
Finally shown in Figure 11 are the weighting functions that resulted from the
Finally shown in Figure 11 are the weighting functions that resulted from the