Chapter 2 Background
2.1 Scalable Video Coding
The scalable video coding (SVC) standard [1] is an extension of the H.264/AVC standard [2] developed by the Joint Video Team (JVT) that uses a single bit-stream to provide multiple frame rates, frame sizes and quality levels while achieving a reasonable coding efficiency.
SVC provides three types of scalabilities: (1) Spatial scalability allows multiple frame sizes. (2) Temporal scalability allows multiple frame rates. (3) SNR scalability allows multiple quality levels. SNR scalability consists of coarse grain scalability (CGS) and fine grain scalability (FGS), which allows flexible truncation of the coded data. SVC supports combined scalability, such that all three types of scalability can be easily combined to support a wide range of spatial, temporal and SNR scalabilities.
The coded data of SVC bit-streams are organized as multiple layers. The base layer provides the basic video quality at the minimum supported bit rate. The enhancement layers successively refine the video quality. SVC provides flexible bit-stream extraction to obtain the desired resolutions or bit-rates on-the-fly.
2.1.1 Encoder Overview
This section presents an overview of the SVC encoder. The encoding is based on a layered approach that uses separate encoder loops for each layer and uses adaptive inter-layer prediction techniques to exploit correlations among the layers.
Spatial scalability and CGS are achieved by multiple layers with a pyramid structure.
Temporal scalability is achieved by a temporal decomposition using hierarchical B
pictures. FGS is achieved by encoding successive refinements of the transform coefficients.
Figure 2 [3] depicts an example of an SVC encoder with three spatial layers.
Each layer is encoded with separate encoder loops, as shown in the dotted boxes.
Prediction
Figure 2: SVC encoder structure with three spatial layers [3]
The input video is spatially scaled to support multiple spatial resolutions. For each spatial layer, the prediction comes from either temporally neighbored pictures at the same layer or spatially up-sampled pictures from lower layers. The inter-layer prediction scheme can reuse the texture, motion and residue information of the lower layers to improve the coding efficiency. After the prediction scheme, the transform coefficients at each spatial layer are encoded with either a scalable entropy coder for FGS or a non-scalable entropy coder for CGS.
2.1.2 Hierarchical-B Prediction Structure
SVC uses hierarchical-B prediction structure to support multilevel temporal scalability. Figure 3 depicts a hierarchical-B prediction structure with 4 temporal levels and a GOP size of 8. Each key picture (black) is either an intra-coded I-frame or a P-frame that uses the previous key picture as the reference picture. Each B-frame is bi-directionally predicted using both previously and future displayed reference pictures from the lower temporal level. The pictures are hierarchically predicted as illustrated.
I0/P0 B3 B2 B3 B1 B3 B2 B3 I0/P0 B3 B2 B3 B1 B3 B2 B3 I0/P0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
group of pictures (GOP) group of pictures (GOP)
I0/P0 B3 B2 B3 B1 B3 B2 B3 I0/P0 B3 B2 B3 B1 B3 B2 B3 I0/P0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
display order
group of pictures (GOP) group of pictures (GOP)
T0 T3 T2 T3 T1 T3 T2 T3 T0 T3 T2 T3 T1 T3 T2 T3 T0 Temporal level
Figure 3: Hierarchical-B prediction structure with a GOP size of 8
The pictures of lower temporal levels are encoded first such that the pictures of higher temporal levels can refer to the reconstructed pictures at lower layers. The higher temporal levels are not required for the decoding of the lower temporal levels.
Each GOP can be independently decoded if the preceding key picture is available.
2.1.3 Inter-layer Prediction Structure
The inter-layer prediction structure is be configured according to the types of layers used. The spatial and CGS layers can flexibly select the reference layer from any lower layers while the FGS layers must be predicted from the previous SNR
layer at the same resolution.
Figure 4 [3] depicts an example of inter-layer prediction with three spatial layers. Each spatial resolution contains several SNR layers. In the first column, BASE_0_0 is the base layer of spatial layer 0. On top, CGS_1_0 and CGS_2_0 are encoded as CGS layers, which are predicted from BASE_0_0 and CGS_1_0, respectively.
Spatial layer 0 Spatial layer 1 Spatial layer 2 FGS_5_2
Spatial layer 0 Spatial layer 1 Spatial layer 2 FGS_5_2
Figure 4: Inter-layer prediction structure with three spatial layers [3]
Note that SVC allows flexible selection of reference layers, such that decoding a certain layer may not need all of its lower layers. As shown, CGS_4_0 refers to CGS_2_0 instead of BASE_3_0, while BASE_3_0 refers to CGS_1_0 instead of CGS_2_0. Therefore, CGS_2_0 is not necessary for decoding BASE_3_0, while BASE_3_0 is not necessary for decoding CGS_4_0. Such flexibility leaves room for further optimizations on performance or error resilience.
2.1.4 Network Abstraction Layer
The elementary unit for the output of an SVC encoder and the input of an SVC decoder is a network abstraction layer (NAL) unit. The NAL unit structure is designed to provide convenient packetization of coded video data for different
transport layers or storage media. NAL units can be categorized into two types:
Video coding layer (VCL) NAL
VCL NAL units contain the coded data of video pictures, such as coded slice, coded slice data partition and suffix NAL units. A coded slice NAL unit contains data of one or more coded macroblocks. A coded slice data partition NAL unit contains partitioned data of a coded slice. A suffix NAL unit contains descriptive information of the preceding NAL unit.
Non-VCL NAL
Non-VCL NAL units contain associated information such as parameter sets and supplemental enhancement information (SEI). Parameter sets contain infrequently changing information that is essential for decoding sequences of video pictures. SEI messages contain additional information that are not required for decoding but assist in related process such as frame output timing, error concealment, and resource reservation. Non-VCL NAL units can be sent out-of-band using a more reliable transport mechanism.
The NAL unit consists of a header followed by a byte string of payload data.
Figure 5 [4] depicts the SVC NAL header, which consists of the one-byte H.264/AVC header and the three-byte SVC extension header.
Figure 5: SVC NAL header structure [4]
Each NAL unit belongs to certain scalability levels and is tagged with the syntax elements dependency_id, temporal_level and quality_level. In SVC, a layer is defined as a set of NAL units with the same value of dependency_id.
The syntax element dependency_id indicates the layer identifier of spatial and CGS layers.
The syntax element temporal_level indicates the hierarchical level of temporal prediction, which relates to the frame rate.
The syntax element quality_level indicates the quality level of FGS layers.