CHAPTER 2 COMMON ARTIFACTS BY ZERO-QUANTIZATION AND NUMERICAL
2.3. Concluding Remarks
In this chapter, we have considered the two common zero-quantization artifacts,
“band-limited” and “birdie” artifacts. An audio patch method comprising two schemes, ZBD and HFR, has been proposed to reduce the two artifacts. The patch method can be incorporated into transform or subband based audio decoders, such as MP3, AAC and HE-AAC. On the other hand, for the computation of the cosine modulated filterbank, we have proposed a fast radix-q DCT-IV algorithm to handle the conflict between parallelism and numerical distortion artifact in the existing algorithms. The radix-q algorithm can be extended into a mixed-radix algorithm for the DCT-IV computation of composition lengths with the merits in parallelism, numerical stability and computational complexity.
CHAPTER 3
ARTIFACTS IN
TEMPORAL NOISE SHAPING
The TNS method [27]-[30] has been utilized in MPEG-2/4 AAC for attenuating the quantization noise preceding the attack signal known as the pre-echo artifact [18], [19]. As illustrated in Figure 3.1, the quantization noise spreads throughout the entire signal block in the time domain. The TNS module can shape and control the spread of quantization noise to improve audio quality.
Since the TNS in AAC is applied to the MDCT coefficients that are highly related to the even DCT-IV, based on the theory of the spectral AR modeling in the DTT domain, we establish the compact form of the TNS in the DTT domain and explain the “time-domain aliasing noise” [30], which has an unusual noise around the attack segment. We also concern the degradation of the artifact with the TNS filter orders. Finally, we compare the TNS by the Hilbert and power envelope methods.
Figure 3.1. Pre-echo artifact (dashed line: original waveform; solid line: quantization noise).
3.1.TNS Formulation in DTT Domain
TNS aims to shape the temporal envelope of the quantization noise by incorporating an open-loop predictive coding [31] across frequency lines in audio encoders/decoders. In terms of z-transform, the concept of TNS can be explained as follows. As depicted in Figure 3.2, x(k) and d(k) denote the input and the predictive residual signals in the frequency domain in the analysis part, whereas xr(k) and dr(k) denote the reconstructed signals related to x(k) and d(k) in the synthesis part. The relation between the reconstruction error r(k), i.e., x(k) − xr(k), and the quantization noise q(k), i.e., d(k) − dr(k), is expressed in z-transform as
) ( 1
) ) (
( H z
z z Q
R = − , (27)
where R(z) and Q(z) are the z-transforms of r(k) and q(k). If the magnitude response of the inverse or whitening filter 1/(1−H(z)) can approximate the temporal envelope of the frequency-domain input signal x(k), the quantization noise Q(e−jω) (in the time domain) can be amplified or attenuated with the temporal shape. Figure 3.3 illustrates the shaping effect of the TNS applied in the MDCT domain.
In [27]-[30], Herre and Johnston have proposed the TNS predictive filter by exploiting the duality between the squared temporal Hilbert envelope and the power spectrum for continuous-time signals. Since, in the literature, there is no derivation for the finite discrete sequences in the DTT domain, this section derives the compact form for the TNS in the DTT domain through the theory of the AR modeling in the DTT domain.
Quantization &Dequantization Quantization &Dequantization BackwardT-F Mapping BackwardT-F Mapping ForwardT-F Mapping ForwardT-F Mapping Linear
Prediction
Input Signal Reconstructed
Signal
Quantization &Dequantization Quantization &Dequantization BackwardT-F Mapping BackwardT-F Mapping
ForwardT-F Mapping ForwardT-F Mapping Linear Prediction
Input Signal Reconstructed
Signal
Quantization &Dequantization Quantization &Dequantization BackwardT-F Mapping BackwardT-F Mapping
ForwardT-F Mapping ForwardT-F Mapping Linear Prediction
Input Signal Reconstructed
Signal
Figure 3.2. Open-loop predictive coding scheme in TNS
(a)
(b) (c)
Figure 3.3. TNS effect. (a) original signal in the time domain; (b) decoded signal without TNS;
(c) decoded signal with TNS.
3.1.1.Autoregressive Modeling in DTT Domain
The AR modeling [53], [64], also known as linear prediction (LP), has received more and more applications in audio coding. The theoretical fundamental for AR modeling of temporal/spectral envelopes with various DTTs has been established in Appendix B. Here, we summarize the critical results related to the TNS formulation.
Through this chapter, we consider all transforms as matrices that left-multiply the input sequence represented as a column vector.
3.1.1.1 Generalized Discrete Fourier Transform
The N × N generalized DFT (GDFT) [77] matrix is defined by +
+
= −
N
b n a k j
n k a,b
) )(
( exp 2
]
[G , π , for k, n= 0, 1, …, N − 1. (28)
Four special forms of the GDFT arise when a and b take on the values 0 or 1/2. They are classified and named as follows [76]:
(i) DFT (Discrete Fourier transform): a = 0 and b = 0.
(ii) OTDFT (Odd-Time DFT): a = 0 and b = 1/2.
(iii) OFDFT (Odd-Frequency DFT): a = 1/2 and b = 0.
(iv) O2DFT (Odd-Time Odd-Frequency DFT): a = 1/2 and b = 1/2.
The last three transforms can be regarded as the modified versions of the DFT with a 1/2-sample delay in the time domain and/or a 1/2-sample advance in the frequency domain.
The inverse GDFT (IGDFT) matrix is the scaled Hermitian transpose of the forward GDFT matrix:
*, , 1
1 1
, H N ba
b N a b
a G G
G− = = , (29)
where superscripts (H) and (*) denote the Hermitian transpose and conjugate operations, respectively.
3.1.1.2 Convolution-Multiplication Property of GDFT The DFT has the convolution-multiplication property that the inverse transformation after entry-wise multiplication gives the same result as the circular convolution of the original sequences. Vernet [78] and Martucci [76] derived such properties for other GDFTs. We summarize the results in matrix form as follows.
Let u = x c y and w = x s y, then the following hold:
3.1.1.3 Discrete Trigonometric Transform
The family of DTTs comprises eight versions of the discrete cosine transform (DCT) and eight versions of the discrete sine transform (DST). Martucci formulated the DTTs through the convolution forms as defined in [76, Appendix]. The orthogonal-like relations between the inverse and forward DTTs are
M I
I T
T−1= 1 , TII−1= M1TIII, TIII−1= M1TII, and TIV−1 = M1 TIV, (36)
where the DTTs in both sides of each equality must be the same in the categories of cosine or sine and even or odd; and M is 2N and 2N − 1 for the even and odd cases, respectively.
3.1.1.4 Analytic Transform based on GDFT and IGDFT
Marple proposed a DFT-based method for computing the analytic signal corresponding to a real-valued finite sequence of an even length [79]. We extend the result to the GDFTs as described in the following.
Via each GDFT, we can define the generic form for the analytic transform matrix:
T q frequencies. Especially, let x denote the real-valued column vector of length M and a= Aq+x, then the analytic vector a has two important properties. First, the real part of a exactly equals the original vector:
n
n x
a =)
Re( , for n = 0, 1, …, M − 1. (39)
Second, the real and imaginary parts of a are orthogonal:
0
For example, let x = [1, −2, −3, 7, 11]T, then
Entry Selection Entry ScalingEntry Scaling Zero PaddingZero Padding IGDFT IGDFT GDFTGDFT
Real vector
|||
Analytic vectorAnalytic Transform Analytic Transform
Real vector Analytic vector
Entry Selection
Entry Selection Entry ScalingEntry Scaling Zero PaddingZero Padding IGDFT IGDFT GDFTGDFT
Real vector
|||
Analytic vectorAnalytic Transform Analytic Transform
Real vector Analytic vector
Figure 3.4. Reconstruction of analytic transform based on GDFT.
Table 3.1. Definitions of Related Matrices for Analytic Transforms
+
diag of order N
+
diag of order N
3.1.1.5 DTT and Analytic Transform
The DTT spectra can be interpreted as the GDFT spectra of analytic vectors in the following way. Given a temporal column vectorx and the DTT vector y = Tq x. Then the IGDFT of the zero-padded scaled DTT equals the analytic transform of the symmetrized temporal vector, that is
) Appendix B. The relation illustrated in (41) is depicted pictorially in Figure 3.5. We take the even DCT-IV for instance. For a real-valued column vector x of length N, the specific expression of (41) is given by
⋅
Scaling Zero PaddingZero Padding
Temporal vector Analytic vector
DTT vector Scaled and zero-padded
DTT vector
Scaling Zero PaddingZero Padding
Temporal vector Analytic vector
DTT vector Scaled and zero-padded
DTT vector IGDFT DTT
Figure 3.5. A pictorial representation of (41).
3.1.1.6 Autocorrelation and Temporal Envelope
The circular and skew-circular autocorrelations of a vector x of length N are defined as
⋅
Just like the time-frequency duality between circular autocorrelations and DFT power spectra, we can have dualities between the GDFT-domain circular or skew-circular autocorrelations and the temporal (IGDFT-domain) envelopes as follows.
Consider a column vector y of length N.
(i) The relation between its skew-circular autocorrelation and IOTDFT/IO2DFT power spectra is given by
]
(ii)The relation between its circular autocorrelation and IDFT/IOFDFT power spectra is given by
By substituting (41) to (45) and (46), we immediately obtain two dualities between the DTT-domain circular or skew-circular autocorrelation and the temporal envelopes. In the following, the two dualities are expressed in generic form, and the specific types of transforms and autocorrelations are defined in Table B.3 in Appendix B.1.6.4.
Given a temporal vectorx and its DTT vector y = Tq x.
For example, given a real-valued column vector x of length N, the two dualities for even DCT-IV are expressed below.
(i) Let I C x Appendix B, we also confirm that the traditional Yule-Walker equations can be solved to yield the AR parameters in the GDFT AR modeling problem.
3.1.2. Evaluation and Representation of Whitening Filter
Let x denote the data vector and y = Tq x. According to (47) and (48), we can have the squared Hilbert envelope or power envelope for shaping the reconstruction noise. As defined in the two dualities, ryˆ and ry consist of the circular or skew-circular autocorrelation of
yˆ andy respectively, where
y W Z
yˆ =[yˆ(0),yˆ(1),...,yˆ(M −1)]T = q q′+ and y=[y(0),y(1),...,y(M −1)]T =Eq′y.
Subsequently, the parameters of the whitening filter are obtained by solving the Yule-Walker equations. Since the relations in the two dualities are based on length M instead of N, we assume that the whitening filter is applied to yˆ covering y in our derivation.
The whitening filter can be represented as a circulant or skew-circulant matrix [32] in the case of the circular or skew-circular convolution. By taking conjugate of both sides of
Hence, the matrix representation H for the whitening filter can be diagonalized by GDFTs as
−1 convolution type.
In the MPEG standard [2], [3], the TNS predictive error filter is performed through the linear convolution (filtering) in the transform domain. In matrix form, the linear convolution L which is lower triangular is the same as the periodic convolution H except for the upper triangular entries. Thus, by padding the input data with suitable zeros, the periodic convolution equals the linear convolution. However, to reconstruct y, all M residuals are necessary to be transmitted to the decoder to perform the periodic deconvolution H−1. In contrast, only the residuals corresponding to y are required for the linear deconvolution L−1 for
it is still lower triangular. Interestingly, if L−1 u = v and vn = 0 for M − P n M − 1, then H−1u = H −1(Lv) = H −1(Hv) = v. Hence, H and H −1are equivalent to L and L−1 on yˆ and the related residuals respectively, and thus we can develop the TNS formulation on yˆ in the periodic convolution/deconvolution manner.
3.1.3. Formulation of TNS
We now establish the formulation of the shaping effect of TNS. First, the dequantized residual d is given by r
y H d
dr = + = ˆ + , (51)
where d is the original residual, and is the additive quantization noise. After deconvolution, the reconstructed spectral sequence yˆ is given by r
H y y
H H d H
yˆr = −1 r = −1( ˆ+ )= ˆ + −1 . (52)
In other words, the quantization noise can be shaped by the periodic deconvolution H −1 in the transform domain. Notice that only the part of d corresponding to y is quantized and transmitted from the encoder to the decoder. Let the zero-padded part be perfectly reconstructed, then the reconstructed noise exists only for non-zero-padded samples of yˆ . Thus we can confirm the equivalency of H and−1 L on to have −1 H−1 = L−1 = Zqn,
where Zq is the zero-padding matrix corresponding to Tq, and n denotes the reconstruction noise related toWq′+y. This implies that some quantization noise should be “virtually”
imposed on the P samples after y to correct the noise propagation in the open-loop prediction.
To check the temporal shaping effect, T is applied to the part ofq−1 yˆ related to y, r
i.e.,(Wq′+)−1ZTq yˆr , to yield the reconstructed temporal sequence x , where r (Wq′+)−1 is
multiplied for removing the scaling of W on yˆ . Before formulating q′+ x , we consider r
another relation between IDTT and IGDFT as follows. For an arbitrary vector z, (41) can be
Thus, by the property that the real part of the analytic transform exactly equals the original sequence, for an arbitrary data vector z, we have
} sequence is given by
ˆ }
Substituting (52) into (55) leads to
}. 0,1,…, M − 1. Hence, results in the temporal shaping effect. Furthermore, due to , the imaginary part of Fq−1 is also involved in the reconstruction noise.
Figure 3.6 illustrates a TNS analysis result based on order-12 AR modeling on the even DCT-IV coefficients of 64 audio samples at 8 kHz sampling rate. As shown in Figure 3.6 (c),
although only 64 quantization noise samples are applied to the 64 residual samples transmitted, the 12 “virtual” quantization noises indexed from 64 to 75 occur when analyzed with the skew-circular convolution. In Figure 3.6 (d), the original time-domain samples and the reconstructed noise are depicted to show the shaping effect. Also notice that the TNS processing is applied to a data segment of length 64 but is analyzed in the O2DFT domain of length 128. Because of symmetry, only one side is shown in this illustration.
Figure 3.6. TNS analysis. (a) The even DCT-IV coefficients of an audio segment of 64 samples at 8 kHz. (b) The predictive residuals by the order-12 whitening filter corresponding to temporal Hilbert envelope. (c) The quantization noise on the residuals indexed 0~63, and the virtual quantization noise indexed 64~75. (d) The original time-domain samples and the reconstruction time-domain noise.
3.2.Artifacts in TNS
It has been known that the lapping operation of MDCT creates the time-domain aliasing and results in the undesired shaping of TNS at silence or weak-energy segments [27]. In this section, we explain the phenomenon through the relation between the MDCT and DCT-IV together with the fundamentals of AR modeling in DTTs.
3.2.1. Time-Domain Aliasing Noise
The N × 2N MDCT matrix M is defined as and the even DCT-IV matrix: [26]
A C
M = eIV ⋅ , (59)
where M is the N × 2N MDCT matrix, CIVe is the N × N even DCT-IV matrix, and A is the N
× 2N time-domain aliasing matrix defined as
−
where IN/2 is the identity matrix and JN/2 is the reversal matrix. The factorization of the MDCT matrix is depicted pictorially in Figure 3.7. Consequently, the MDCT of a finite sequence of length 2N is equal to the even DCT-IV of the aliased sequence of length N. According to the time-domain aliasing cancellation (TDAC) principle [74], the aliasing effect can be perfectly removed by the overlap-and-add operation, which makes the MDCT especially attractive in audio coding for the blocking effect reduction. However, the time-domain aliasing operation of MDCT brings the “time-domain aliasing noise” artifact in TNS.
According to (47), when the linear predictive parameters are estimated in the DCT domain, the spectral magnitude response of the corresponding inverse filter should fit the temporal Hilbert envelope of the time-domain original signal. Equation (59) implies that the predictor evaluated from the MDCT of an audio segment is equal to that evaluated from the DCT-IV of the aliased one. More specifically, the duality formula (47) in this situation is given as follows.
Let ˆ 2I C (Axˆ) 0
y I N IVe
N N
N ⋅ ⋅
=
×
, then
−
= + − +
* ,
ˆ 0 ( ˆ) ( ˆ)
2
1 Ax
J A I x
J A A I G ryS
N e N IV N
e N
IV ,
where xˆ means a windowed input signal (e.g., when the sine window is applied, x
xˆ =diag{sin[π(n+1/2)/(2N)]|n=0,1,2,...,2N −1}⋅ ). Thus, rather than the original temporal Hilbert envelope, the inverse filter evaluated in the MDCT domain shapes the time-domain quantization noise according to the temporal Hilbert envelope of the aliased time-domain signal. Consequently, as illustrated in Figure 3.8, the artificial pre/post-aliasing artifacts are introduced due to the time-domain aliasing operation of MDCT. The aliasing noise may occur at perceptually sensitive positions (e.g., silence segments) and degrade the audio quality. Figure 3.9 and Figure 3.10 illustrate the pre-aliasing and post-aliasing artifacts, respectively.
=
M CIV A
=
M CIV A
Figure 3.7. MDCT factorization. Identity and reversal matrices are represented by diagonal and anti-diagonal lines and row vectors are represented by horizontal lines.
-1 -0.5 0 0.5 1
time index (a)
-1 -0.5 0 0.5 1
time index (b)
-1 -0.5 0 0.5 1
time index (c)
-1 -0.5 0 0.5 1
time index (d)
Figure 3.8. The time-domain aliasing of MDCT:(a) the input signal and the analysis sine window; (b) the post-aliasing signal corresponding to (a); (c) the input signal and the analysis sine window; (d) the pre-aliasing signal corresponding to (c).
(a)
(b)
(c)
Figure 3.9. TNS pre-aliasing artifact: (a) original signal in time domain; (b) decoded signal without TNS; (c) decoded signal with TNS.
(a)
(b)
(c)
Figure 3.10. TNS post-aliasing artifact: (a) original signal in time domain; (b) decoded signal without TNS; (c) decoded signal with TNS.
3.2.2. Aliasing Noise by High-Order TNS
The accuracy of AR modeling generally rises with increasing predictive orders. This implies that the spectral magnitude response of the evaluated inverse filter should fit more and more accurately the temporal envelope of the original time-domain signal when the predictive order increases. For attack signals, the predictor shapes the abrupt changes of temporal attacks.
Depending on the time-domain aliasing nature of MDCT mentioned above, the pre-aliasing or post-aliasing artifacts deteriorate with the TNS order due to the higher abrupt shaping. For instance, comparing Figure 3.11 (c) with (b) shows that the TNS of order 12 concentrates the quantization noise within the attack position but worsens the pre-aliasing artifact. Hence, the predictive order cannot be decided purely on complexity or coding gain.
(a)
(b) (c)
Figure 3.11. Deterioration of TNS aliasing artifact with high TNS orders: (a) original time-domain signal; (b) reconstruction noise with order-3 TNS; (c) reconstruction noise with order-12 TNS.
3.2.3. Artifacts Reducing Method
By applying the property that the IMDCT (inverse MDCT) matrix is the scaled transpose of the MDCT matrix to (59), we have
IV N
N
N N IV
T T
N N
N C
I J
J I C
A M
M ⋅
−
−
⋅ −
=
=
=
0 0 0
0 1 1
1
~
2 /
2 /
2 / 2 /
, (61)
where M~ is the 2N × N IMDCT matrix. The factorization of the IMDCT matrix is depicted pictorially in Figure 3.12. Equation (61) specifies the symmetric structure of the IMDCT output and implies that the shaped quantization noise by TNS must have the same symmetric structure after the IMDCT conversion. Unlike the aliased original signal, the shaped quantization noise cannot be perfectly cancelled by the overlap-and-add operation.
Accordingly, the time-domain aliasing noise always accompanies symmetrically with the shaped noise centralized in an attack. This means that the aliasing artifact cannot be avoided through the TNS filter design.
In audio coding, the window switch [1]-[3] is another mechanism for handling attack signals, where the start and stop windows are used in the transition between a long window and a short window. In [33], a method is proposed to detect attacks and to apply the start and stop windows in AAC to attenuate the aliasing noise (see Figure 3.13). Figure 3.14 illustrates the effect of the stop window. As shown in Figure 3.14 (d), the aliasing term of the original signal can be removed through the windowing operation, instead of the overlap-and-add operation. In the same way, the aliasing noise can be eliminated. Similar concept is adopted in MPEG-4 Low Delay AAC, where a window which exhibits only a small overlap between subsequent frames is provided to minimize the time-domain aliasing noise [34]. Figure 3.15 provides an example to compare the waveforms and spectrograms of several signals including the original signal, the decoded signal without TNS, the decoded signals with the TNS of order 3 and 12, and the decoded signal with the TNS of order 12 and the artifacts reducing method. A comparison of Figure 3.15 (h) and (i) shows that the stronger noise centralized in the aliasing segment arises in the case of TNS order 12. On the other hand, in Figure 3.15 (j), the time-domain aliasing noise of the decoded signal with TNS order 12 is eliminated by the artifacts reducing method.
=
C
IVA
TM~
⋅ N
=
C
IVA
TM~
⋅ N
Figure 3.12. IMDCT factorization. Identity and reversal matrices are represented by diagonal and anti-diagonal lines and row vectors are represented by horizontal lines.
0 2047 4095
-1 -0.5 0 0.5 1 1.5
Long window Start window Stop window
0 2047 4095
-1 -0.5 0 0.5 1 1.5
Long window Start window Stop window
u
Figure 3.13. Artifact reducing method for TNS time-domain aliasing by the start and stop windows.
Figure 3.14. The effect of the stop window: (a) the input signal and the analysis stop window;
(b) the windowed output; (c) output of IMDCT; (d) final output behind the synthesis stop window.
Figure 3.15. TNS artifact Effect from the different TNS orders: (a) the original waveform;
(b) the waveform without TNS; (c) the waveform with TNS order 3; (d) waveform with TNS order 12; (e) the waveform from the artifacts reducing method for the TNS with order 12 and; (f)-(j) the spectrograms corresponding to (a)-(e) respectively.
3.2.4. TNS by Hilbert Envelope and Power Envelope
Figure 3.16 illustrates the noise shaping effect of the Hilbert-envelope method and the power-envelope method, where the two order-12 AR modeling methods are applied to a transient audio segment of 2048 samples at 44.1 kHz. The inverted magnitude responses of
Figure 3.16 illustrates the noise shaping effect of the Hilbert-envelope method and the power-envelope method, where the two order-12 AR modeling methods are applied to a transient audio segment of 2048 samples at 44.1 kHz. The inverted magnitude responses of