Chapter 1 Introduction
1.2. Scope of This Thesis
Various transcoding algorithms provide trade-off between the computational complexity and reconstructed video quality [3]-[11]. In all, the most straightforward approach of realizing transcoding functionality is to cascade a decoder and an encoder, as shown in Fig. 2. The decoder decodes the original input video, and the encoder re-encodes the decoded data subject to any new constraints. Such a cascaded transcoding architecture which fully decodes the bitstream and re-encodes the video is treated as one of the most complicated methods and is very much computationally intensive. In this thesis, we aim to introduce more efficient techniques to balance the perceptual quality and the computational complexity.
Fig. 2. Cascaded decoder and encoder transcoder
Based on the concepts of UMA, we build a simplified UMA model on the internet.
In this framework, the source video material is encoded and archived as MPEG-4 Fine
Granularity Scalable (FGS) bitstreams. In order to provide access to FGS coded bitstreams with both FGS enabled devices and devices that only support single-layer coding standards, a novel transcoder is proposed and implemented to convert the bitstream from FGS format to other single-layer video formats including MPEG-4 Simple Profile (SP), MPEG-2, and MPEG-1. Depending on the terminal capability, the universal transcoder is capable of supporting video contents of different formats.
The remainder of this thesis is organized as follows. Chapter 2 classifies various transcoding architectures and discusses the fundamental problems. In Chapter 3, a framework for FGS to single-layer transcoding is introduced. Chapter 4 proposes the drift compensation architecture. Experimental results and analysis are presented in Chapter 5.
Finally, Chapter 6 gives the concluding remarks.
Chapter 2
State of the Art
In this chapter, we will explore the prior arts on homogeneous (with similar standard) and heterogeneous (between different standards) video transcoding. An overview of transcoding architectures and techniques is provided along with an evaluation and discussion.
2.1. Transcoding Architectures
Efficient and effective transcoding architectures are needed to expand the applications of multimedia information exchange. As mentioned in the previous chapter, it is always possible to use a cascaded approach to perform transcoding. However the cost is too high to be feasible. The following subsections introduce several transcoding architectures that aim to reduce the computational complexity of the straightforward decoder-encoder implementation. The rationale behind these transcoding architectures is to reuse existed coding parameters or statistics from the input video bitstream. They may be applied not only to simplify the computation, but also to maintain or even improve the visual quality.
2.1.1. Spatial-Domain Video Transcoding
Fig. 3 illustrates the Cascaded Pixel-Domain Transcoder (CPDT) [3], which is very similar to the brute-force method. It is a concatenation of a standard decoder and a simplified encoder. The encoder does not perform full-scale motion estimation (ME).
Instead, it reuses the motion vectors as well as other information extracted from the input video bitstream. It is shown in [12] that the macroblock mode decision and ME module occupy about 70% of the overall encoder processing power. Hence, by avoiding these computationally intensive operations, it can speed up the transcoding process by roughly three times.
Fig. 3. Cascaded Pixel-Domain Transcoder (CPDT)
The prior works based on CPDT architecture focus on the improvement of new motion vector (MV) information in the encoder-loop. Generally, a homogeneous transcoder which is designed for bit rate adaptation requires no MV mapping operation.
The functional unit of MV Mapping presented in CPDT is used particularly for transcoding which involves spatial resolution adjustment or heterogeneous format change. However, since reduced resolution transcoding is beyond the scope of this thesis, we will focus our attention to the problem of changes in the directionality of MVs in heterogeneous transcoding. For instance, in the MPEG-4 standard [2], an inter-coded macroblock comprises either one MV for the complete macroblock or four MVs, one for
each non-transparent 8×8 pel blocks forming the 16×16 pel macroblock, whereas MPEG-1/2 support only 16×16 prediction in progressive frames. For transcoding bitstreams from MPEG-4 format into MPEG-1/2 format, problems arise when passing MVs directly from the decoder to the encoder. Fig. 4 shows the way of multiple MVs being merged to a single MV when the coding mode is different. Hence, a motion vector mapping operation is required. A variety of methods have been discussed for deriving a new MV from the four MVs available in the input bitstream information. Although these methods are originally developed for reduced resolution transcoding, they are also applicable in our scenario. One strategy is to select one of the incoming MVs in random [14]. Weighted average taking into account the prediction error is presented in [13]. Some other methods, such as median, majority, and average, are presented and compared in [12]. To further improve the accuracy for prediction, MV refinement is performed in a small search window around the composite MV [15]. Other issues related to heterogeneous transcoding, such as picture type conversion [12] or frame rate reduction [15]-[17], will not be mentioned later due to out of thesis scope.
Fig. 4. Motion vector composition
CPDT is usually considered to be free of drift error, and used as the benchmark for evaluating the performance. Although the reconstructed frames in the decoder-loop and the encoder-loop don’t match due to heterogeneous quantization table and other parameters, this CPDT will not introduce drifting error theoretically. The reason is that the encoder-loop will reconstruct the new coding residues to avoid the improper MV causing unexpected reconstruction mismatch. In addition, this CPDT is also flexible in
coding-parameter changes. Because the decoder-loop and the encoder-loop separate from each other, more flexibilities are allowed to operate the transcoded video at different bit rates, frame rates, picture resolutions, coding modes, and even different standards.
2.1.2. Frequency-Domain Video Transcoding
The frequency-domain based video transcoding operates the video decoding and re-encoding in the transform domain. In contrast to the spatial-domain transcoding which operates the video transcoding in the spatial domain, the frequency-domain (or called transform-domain) transcoding architecture avoids three unnecessary transformations (one backward transform in the decoder-loop and one forward/backward transform in the encoder-loop) to achieve equal coding efficiency with lower complexity. This subsection introduces the core techniques in frequency-domain transcoding and reviews two types of commonly used frequency-domain transcoding architecture.
2.1.2.1. Generic Frequency-Domain Transcoder
Fig. 5. Source block extraction problem
Motion compensation in the DCT-domain (MC-DCT) is the core technique in frequency-domain transcoding architecture. It operates the motion compensation (MC) in DCT domain to reconstruct the video without converting to pixel domain. The related researches in MC-DCT can be found in prior works [9], [18]-[20]. The design idea of
MC-DCT is to build the relationship between the motion compensated 8×8 block (P) and the related reference blocks (Q). Since the MV (∆x, ∆y) usually is not 8×8 block aligned (see Fig. 5), the motion compensated block would cover at most four neighboring 8×8 blocks from the predicted MV center. The relationship in pixel domain is represented as eqn. (1).
Q2 Upper right
8 h−0
From eqn. (1), the motion compensated block P consists of the lower-right part of sub-block Q0, the lower-left part of Q1, the upper-right part of Q2, and the upper-left part of Q3. Each component is computed by the pre-multiplication of Hi, which shifts the sub-block of interest vertically, and the post-multiplication of Gi, which shifts the sub-block horizontally.
Since the DCT is a unitary orthonormal transform, it is distributive to matrix multiplication. Hence, we can express the DCT representation of eqn. (1) as
G (2)
where the 2D-DCT of an 8×8 block A is represented as  = DCT(A). The horizontal and vertical displacement matrices can be pre-computed and stored in memory.
In eqn. (2), the extraction of a single 8×8 block requires up to 4 × 8 × 64 × 2 = 4096 floating-point multiplications, which are huge computations. To speed-up the MC-DCT in [18], several faster implementations are proposed to improve the computation of eqn.
(2). In [19], MC-DCT is simplified through factorizing the displacement matrices into relatively sparse matrices such that the number of computations required is reduced. The work in [9] approximates the elements of Ĥi and Ĝi to binary numbers and replaces the multiplication with shifters and adders. Another efficient computation method on macroblock basis is derived in [20]. It utilizes shared information within a macroblock, such as MV and common blocks, to yield substantial speedup in computation.
The generic frequency-domain transcoder is constructed with the MC-DCT technique to allow entire transcoding operations in the frequency domain. Since MC can be performed in the frequency domain, the operations of DCT and IDCT in Fig. 3 can be saved to allow a structurally more efficient transcoding. Fig. 6 shows the design flow of the Cascaded DCT-Domain Transcoder (CDDT) proposed in work [6], which operates the transcoding without any DCT transformations in the frequency domain.
There are two design issues in the generic frequency-domain transcoding architecture. The first is that MC-DCT may introduce reconstruction mismatches to cause transcoding drift error. Ideally, CDDT is a functionally equivalent representation of CPDT which is drift-free. However, the matrix multiplications in MC-DCT may introduce operational precision mismatches, and frame reconstruction in the frequency domain may introduce rounding errors. Although theses mismatches may cause drift, this type of transcoding error only leads to slight quality degradation and is almost ignorable.
The second design issue is the operational complexity of the MC-DCT. The complexity of CDDT highly depends on the MC-DCT implementation method, and the complexity of MC-DCT depends on the complexity of matrix multiplication. Thus, applying an efficient
matrix multiplication in MC-DCT can highly improve the frequency-domain transcoding efficiency.
Fig. 6. Cascaded DCT-Domain Transcoder (CDDT)
2.1.2.2. Simplified Frequency-Domain Transcoder
More complexity reduction can be achieved by analyzing and improving the redundancy in CDDT. For real-time applications, the complexity in CDDT is still too high to be used. To analyze the CDDT as shown in Fig. 6, we find the behavior of the frame reconstruction in the decoder-loop and the encoder-loop is almost identical under the assumption of the same MC behavior and the same quantization step size. The MC behavior depends on video encoding standards and will not be an issue if both the decoder-loop and encoder-loop specify the same MC structure. The quantization step size is controlled by many encoding parameters such as the rate control. Therefore, to simplify the CDDT architecture, the two frame reconstruction operations should be merged by using a shared MC to compensate for the quantization mismatches between the decoder-loop and the encoder-loop. Such an idea can be easily realized in homogeneous
transcoding which uses the same MC structure. Some prior works in [7]-[9] design their simplified transcoders based on such a design idea to save one more MC operation.
To identify the design idea, a brief derivation is provided as follows. From Fig. 6, the residual in the encoder-loop is given by
( )2
(
*1, ( ))
n n n
X =X −MC X − mv
+ 2 (3)
where MC(.) is the motion compensation process, the subscript on the variable indicates time, and the superscript of “1” and “2” represents the decoder-loop and the encoder-loop, respectively. Here, we denote a signal with quantization effect as
( ( ) )
X*=IQ Q X (4)
where Q(.) and IQ(.) stand for quantization and inverse quantization, respectively. The reconstructed signal in the decoder-loop is given by
( )1
(
1, ( ))
n n n
X =B +MC Y− mv1 (5)
Substituting eqn. (5) into eqn. (3), we can yield
( )1
(
1, ( )1)
( )2(
* ( ))
n n n n
X =B +MC Y− mv −MC X mv
+ −1, 2 (6)
Assuming mv(1) = mv(2) (i.e., MVs are not recalculated) and the sub-pixel MCs in the decoder-loop and encoder-loop perform the same interpolation filtering, it can be stated that
( )1
(
n, ( )1)
( )2(
n, ( )2)
MC X mv =MC X mv (7)
Based on the assumption that the MC is a linear operation, i.e., MC(X + Y, mv) = MC(X, mv) + MC(Y, mv), we may rewrite eqn. (6) as
(
1 *1, ( )1)
n n n n
X =B +MC Y− −X − mv
+ (8)
From eqn. (8), the prediction residual in the encode-loop in the transcoder can be obtained by adding the motion compensated frame differences to the incoming prediction residual.
Since Yn = Xn, we have
(
1 *1, ( )1)
n n n n
X =B +MC X − −X − mv
+ (9)
Furthermore, we may get the corresponding equivalent equation for Xn-1 – X*n-1 by applying eqn. (3).
(
( ))
Finally, eqn. (9) is reduced to
(
1 *1, ( )1)
n n n n
X =B +MC X − − X − mv
+ + + (11)
Based on eqn. (11), the architecture in Fig. 6 is transformed into the architecture in Fig. 7.
This is referred to as the Simplified DCT-Domain Transcoder (SDDT).
Significant complexity reduction is attained in SDDT. Compared to CPDT in Fig.
3, SDDT not only eliminates the DCT/IDCT, but also reduces the size requirement of frame buffers by half. Only one MC loop is required to store the difference values between the reconstructed pictures in the decoder-loop and the encoder-loop in this architecture. This complexity reduction is achieved in sacrifice of the flexibility of cascaded architectures. In the above derivation, SDDT assumes the MVs after the transcoding to be the same as those before the transcoding in order to merge the two MCs.
This architecture is based on the assumption of using the same MC structure, so SDDT has limited applications such as bit rate transcoding.
Fig. 7. Simplified DCT-Domain Transcoder (SDDT)
Fig. 8. Open-loop transcoder
2.2. Video Transcoding Techniques
The video transcoding techniques are built upon the transcoding architectures presented in Section 2.1, and used to improve the transcoding performance by adjusting the encoding parameters. Two common transcoding techniques including intra refreshment and rate control are reviewed.
2.2.1. Intra-Refresh Technique
To stop the drift propagation of errors introduced in reduced resolution transcoding, an intra-refresh transcoding technique is proposed in [10]. The intra-refresh technique adaptively forces the inter-coded blocks to be intra-coded based on drift estimation in the compressed bitstream. Since intra-coded blocks will not use the other frames for image reconstruction, this type of conversion stops the drift propagation.
Fig. 9. Intra refresh in open-loop architecture
Fig. 9 shows an open-loop transcoder in which the intra-refresh technique is applied. The module Inter-to-Intra Conversion in Fig. 9 either bypasses the inverse quantized DCT coefficients or uses the reconstructed coefficients from the frame memory instead according to the intra-refresh rate, which is the percentage of intra-coded macroblocks in one frame. The intra-refresh rate is adaptively adjusted according to the estimated value of drift. It should be noted that more bits are usually required for coding intrablocks. Therefore, the intra-refresh operation and the rate control must be considered jointly.
Although the intra-refresh technique demonstrates the ability to correct drift errors, its effectiveness is achieved using additional MC and frame memory to reconstruct the reference frame for the inter-to-intra conversions of the DCT coefficients. The architecture in Fig. 9 may seem to require less memory and computation than CPDT and CDDT. The reason is that the open-loop transcoder upon which the intra-refresh technique is implemented needs no MC prediction loop at all. If the intrablock refresh method is realized in close-loop architectures [11], the question whether complexity reduction is possible remains debatable.
2.2.2. Rate Control Issues
The purpose of rate control is to provide better and consistent video quality under the bandwidth constraint. It involves two basic steps, picture-layer bit allocation and macroblock-layer rate control. The picture-layer bit allocation determines the target bit budget for each frame. The macroblock-layer rate control adjusts the quantization parameters for coding the macroblocks. Generally speaking, all rate-control algorithms designed for video coding are applicable to transcoding.
Rate control in transcoding either targets at providing accurate bit rate adaptation or improving the coding efficiency by exploiting the coding statistics collected from the input compressed bitstream. The design issue for bit rate transcoding is actually the same as that for conventional video coding. It is to allocate proper bits to a picture proportional to its complexity such that the output rate would comply with the bit rate constraint. The only difference lies in the availability of content characteristics for transcoding. A straightforward implementation for bit rate transcoding might scale the input bits of each frame, which can be easily obtained from the pre-encoded video streams, according to the rate conversion ratio. Better bit allocation is possible by further exploiting the correlations between the input and the output picture complexities [21]. In [22], a ρ-domain rate-distortion model is adopted to obtain the optimal number of bits for each frame. This frame-level rate-distortion information is pre-generated in the front-encoder and transmitted to the transcoder as side information. The work in [9] derives the optimal set of quantizer scales based on Lagrangian optimization.
2.3. Evaluation of Transcoding Architectures
In the previous sections, we have discussed the transcoding architectures and transcoding techniques. Each architecture raises different design trade-off issue in computational complexity and visual quality. This section analyzes these transcoding architectures in terms of complexity and drift error.
2.3.1. Complexity Analysis
Table 2 shows the complexity analysis for four types of transcoding architecture.
The first type referred to as DEC-ENC implements a straightforward method to fully decode the input bitstream and fully encode the reconstructed video from the decoder side. Such a method doesn’t save any computations and is the most computationally intensive. It needs 1 ME, 3 DCT/IDCT, and 2 MC operations. Type II implements CPDT architecture which reconstructs the video in pixel domain. Such a method saves 1 ME compared to Type I. Type III implements CDDT architecture which is a generic DCT-domain transcoder. This type of architecture saves 3 more DCT/IDCT operations compared to Type II. Type IV shows the most competitive ability in computational complexity compared to the first three types. It implements the simplified CDDT (also referred to as SDDT) which saves 1 more MC operation and 1 more frame buffer compared to Type III. From the viewpoint of computational complexity, Type I suffers most efforts in transcoding and Type IV is the most computationally efficient architecture which has more than 50% of computation reduction.
Table 2. Complexity analysis of four transcoding architectures
MC Type Transcoding
Architecture
ME Frame Buffer
DCT/
IDCT Spatial Transform
I DEC-ENC 1 2 3 2 0
II CPDT 0 2 3 2 0
III CDDT 0 2 0 0 2
IV SDDT 0 1 0 0 1
2.3.2. Drift Error Analysis
Drift errors come from imperfect frame reconstruction during transcoding procedure, and the imperfect frame reconstruction causes the mismatches to propagate between frames. Analyzing the four architectures in Table 2, the mismatches come from two major sources. The first type of mismatch comes from arithmetic operations
including rounding errors or precision conversion errors, and is also referred to as arithmetic error in this thesis. We have identified three possible sources for the arithmetic error. The first source of error relates to the floating-point operation in transcoding. For example, different from pixel-domain MC, DCT-domain MC reconstructs the video in the DCT domain through floating-point matrix multiplication.
But no processor can provide infinite precision to accurately manipulate these numeric data. Mismatch is then introduced. The second and third sources of error are due to the failed linearity assumption on which the derivation of SDDT is based, and hence are unique to SDDT. In the derivation of SDDT, in order to merge the MC operations in the decoder-loop and encoder-loop, we have assumed that MC is a linear operation which is not strictly true in practical situations. The second source of error comes from the
But no processor can provide infinite precision to accurately manipulate these numeric data. Mismatch is then introduced. The second and third sources of error are due to the failed linearity assumption on which the derivation of SDDT is based, and hence are unique to SDDT. In the derivation of SDDT, in order to merge the MC operations in the decoder-loop and encoder-loop, we have assumed that MC is a linear operation which is not strictly true in practical situations. The second source of error comes from the