Artifacts Reducing Method - Artifacts in TNS

CHAPTER 3 ARTIFACTS IN TEMPORAL NOISE SHAPING

3.2. Artifacts in TNS

3.2.3. Artifacts Reducing Method

(b) (c)

Figure 3.11. Deterioration of TNS aliasing artifact with high TNS orders: (a) original time-domain signal; (b) reconstruction noise with order-3 TNS; (c) reconstruction noise with order-12 TNS.

3.2.3. Artifacts Reducing Method

By applying the property that the IMDCT (inverse MDCT) matrix is the scaled transpose of the MDCT matrix to (59), we have

IV N

N N IV

T T

N N

N C

I J

J I C

A M

M ⋅

−

⋅ −

0 0 0

0 1 1

2 /

2 / 2 /

, (61)

where M~ is the 2N × N IMDCT matrix. The factorization of the IMDCT matrix is depicted pictorially in Figure 3.12. Equation (61) specifies the symmetric structure of the IMDCT output and implies that the shaped quantization noise by TNS must have the same symmetric structure after the IMDCT conversion. Unlike the aliased original signal, the shaped quantization noise cannot be perfectly cancelled by the overlap-and-add operation.

Accordingly, the time-domain aliasing noise always accompanies symmetrically with the shaped noise centralized in an attack. This means that the aliasing artifact cannot be avoided through the TNS filter design.

In audio coding, the window switch [1]-[3] is another mechanism for handling attack signals, where the start and stop windows are used in the transition between a long window and a short window. In [33], a method is proposed to detect attacks and to apply the start and stop windows in AAC to attenuate the aliasing noise (see Figure 3.13). Figure 3.14 illustrates the effect of the stop window. As shown in Figure 3.14 (d), the aliasing term of the original signal can be removed through the windowing operation, instead of the overlap-and-add operation. In the same way, the aliasing noise can be eliminated. Similar concept is adopted in MPEG-4 Low Delay AAC, where a window which exhibits only a small overlap between subsequent frames is provided to minimize the time-domain aliasing noise [34]. Figure 3.15 provides an example to compare the waveforms and spectrograms of several signals including the original signal, the decoded signal without TNS, the decoded signals with the TNS of order 3 and 12, and the decoded signal with the TNS of order 12 and the artifacts reducing method. A comparison of Figure 3.15 (h) and (i) shows that the stronger noise centralized in the aliasing segment arises in the case of TNS order 12. On the other hand, in Figure 3.15 (j), the time-domain aliasing noise of the decoded signal with TNS order 12 is eliminated by the artifacts reducing method.

=

C

_IV

A

⋅ N

=

C

_IV

A

⋅ N

Figure 3.12. IMDCT factorization. Identity and reversal matrices are represented by diagonal and anti-diagonal lines and row vectors are represented by horizontal lines.

0 2047 4095

-1 -0.5 0 0.5 1 1.5

Long window Start window Stop window

0 2047 4095

-1 -0.5 0 0.5 1 1.5

Long window Start window Stop window

Figure 3.13. Artifact reducing method for TNS time-domain aliasing by the start and stop windows.

Figure 3.14. The effect of the stop window: (a) the input signal and the analysis stop window;

(b) the windowed output; (c) output of IMDCT; (d) final output behind the synthesis stop window.

Figure 3.15. TNS artifact Effect from the different TNS orders: (a) the original waveform;

(b) the waveform without TNS; (c) the waveform with TNS order 3; (d) waveform with TNS order 12; (e) the waveform from the artifacts reducing method for the TNS with order 12 and; (f)-(j) the spectrograms corresponding to (a)-(e) respectively.

3.2.4. TNS by Hilbert Envelope and Power Envelope

Figure 3.16 illustrates the noise shaping effect of the Hilbert-envelope method and the power-envelope method, where the two order-12 AR modeling methods are applied to a transient audio segment of 2048 samples at 44.1 kHz. The inverted magnitude responses of the two skew-circular predictors corresponding to the Hilbert and power envelopes are aligned in energy and depicted in Figure 3.16 (c). The quantization noises on the residuals are simulated by a white random sequence shown in Figure 3.16 (d). The reconstructed temporal noises by the two predictors are shown in Figure 3.16 (e) and (f). As shown in Figure 3.16 (c),

the magnitude response of the predictor corresponding to the power envelope is sharper than that corresponding to the Hilbert envelope at the silence segment. Therefore, the pre-echo artifact in Figure 3.16 (f) has higher attenuation when compared with that in Figure 3.16 (e).

The major difference of the two methods comes from the envelope estimation of the low frequency tones. The Hilbert envelope can avoid the smoothing effect by removing the low frequency lines in the calculation of the filter coefficients while applying the noise shaping to all the frequency lines to achieve similar effects as the power envelope method.

Figure 3.16. Comparison of TNS effect by the order-12 predictors corresponding to the Hilbert and power envelopes. (a) A transient audio segment of 2048 samples at 44.1 kHz. (b) The even DCT-IV coefficients. (c) The energy-aligned inverted magnitude responses of the two skew-circular predictors corresponding to the Hilbert and power envelopes. (d) The simulated quantization noise. (e) The reconstruction temporal noises by the predictor corresponding to the Hilbert envelope. (f) The reconstruction temporal noises by the predictor corresponding to the power envelope.

3.3.Concluding Remarks

In this chapter, the compact form of TNS has been established for 16 DTTs through the spectral AR modeling theory of finite discrete signals. According to the compact form, the well-known “time-domain aliasing noise” artifact associated with TNS in the MDCT domain has been explained analytically. The time-domain aliasing noise deteriorates with the TNS predictive order. A reduction method combining TNS and window switch has been proposed to reduce this artifact. We also compared the TNS effects by the Hilbert envelope and power envelope.

CHAPTER 4 ARTIFACTS IN

SPECTRAL BAND REPLICATION

In contrast to the traditional transform or subband coding methods such as AAC and MP3, the SBR exploits the similarity between low frequency (LF) and high frequency (HF) spectra to reconstruct high bands by replicating low bands. The efficient coding method of HF brings several new types of artifact.

4.1. SBR Overview

SBR is a technique of bandwidth extension or high frequency reconstruction and can be combined with any audio core coders such as AAC and MP3. SBR reconstructs high bands by transposing and adjusting the replicated low bands thanks to the strong correlation of spectral harmonic characteristics. Only a small amount of side information, including spectral envelope data and control parameters for additional means such as inverse filtering and noise/sinusoidal addition, is transmitted from the encoder to the decoder for guiding the HF reconstruction. Since SBR requires significantly lower bit rate for high bands and reduces the underlying core coder bandwidth, the core encoder can compress the LF part with most of the available bits to achieve high coding efficiency.

As depicted in Figure 4.1, in addition to the analysis/synthesis filterbank, the SBR decoding has three major procedures. In the HF generator, the low bands split from the decoded LF signal are first transposed to HF. Subsequently, to control tonality, the inverse filtering is applied to the regenerated high bands to clip the undesired sinusoidal components from low bands. The inverse filtering is performed by in-band filtering using an adaptive spectral whitening filter. The second-order covariance method is employed to evaluate the

whitening filters on low bands. Furthermore, a chirp factor given from the bitstream is used to control the amount of inverse filtering by moving the two zeros of the LP filter toward the origin. The regenerated high band xk(n) for QMF subband k and time slot n is defined as:

) 2 ( )

2 ( ) 1 ( )

1 ( ) ( )

(n = x n −a ⋅c ⋅x n− −a ⋅c² ⋅x n−

x_k _l _l _k _l _l _k _l , (62)

where al(1) and al(2) are the predictive coefficients estimated on the low band xl(n), and ck is the chirp factor whose range is between 0 and 0.98. In the envelope adjuster, the envelope of the regenerated high bands is scaled according to the transmitted envelope information that is represented by the average energies in time-frequency (T-F) grids (explained below).

Subsequently, additional tones and random noise are compensated to adjust the tonality of the reconstructed high bands. Finally, all low and high bands are synthesized to generate a full-bandwidth decoded signal.

Figure 4.2 illustrates the reconstruction procedures of SBR in HE-AAC decoder. In the HF generator, the low QMF bands analyzed from the decoded LF AAC signal are replicated to HF and further inversely filtered (see Figure 4.2 (c)). In Figure 4.2 (d), the envelope of the replicated bands is adjusted; moreover, the compensation of tone and noise is applied to adjust the tonality of the reconstructed signal.

Figure 4.1. The block diagram of the SBR decoder.

(a)

(b)

(c)

(d)

Figure 4.2. HF reconstruction process of SBR: (a) original spectrum; (b) decoded AAC LF spectrum; (c) HF generation by SBR; (d) HF adjustment by SBR.

The T-F grid for recording energy data (see Figure 4.3) is formed through the “time borders” and the frequency band borders that are indicated in “high/low resolution frequency band tables” [5]. The T-F grid determines the resolutions of data record units in the time and frequency dimensions. In the same way, the “noise-floor frequency table” and “limiter frequency table” are used to define the frequency resolution for noise compensation and

scaling-gain limitation, respectively. All the tables are constructed from the “master frequency band table” that can vary with spectral contents. The decision of T-F grid is one of the most critical design issues of the SBR encoder [61]. More details about the SBR algorithm can be found in [5]-[9].

Figure 4.3. An instance of the T-F grid in SBR [7].

4.2. Tone Trembling Artifact

SBR aims to reconstruct high bands by replicating low bands. The “patching algorithm”

[5] defined in the SBR syntax determines the correspondent relation between replicated low bands and original high bands. The patching algorithm has three constituting factors, namely the master frequency band table and the start as well as the stop boundaries of the SBR range.

SBR permits to vary frequency band tables to adapt the frequency resolution of encoding according to spectral envelopes. Furthermore, depending on the encoding difficulty of the LF part, the SBR range is variable to adapt different conditions. However, a flexible design of SBR through switching tables or adjusting the SBR range to control the overall quality generates time-varying LF replication sources and thus leads to spectral discontinuities in

regenerated subbans. As illustrated in Figure 4.4, at the present frame, the 8th low band is replicated to the high band according to the patching algorithm, while the replicated source can be changed to the 10th low band at the next successive frame.

For noise-like signals, the resultant discontinuity level of reconstructed spectra is in general small, and the human hearing is insensitive to the artifact. But, for tonal signals, the human hearing is very sensitive to the artifact. To highlight this problem, Figure 4.5 provides an artificial example with frequently varying tables. The “billow-like” spectrogram originates from the replicated LF tones. This artifact sounds “trembling” and hence is named the “tone trembling” artifact. To analytically model the artifact, each specific replicated LF tone can be represented as

( )

(

ω ⋅ +θ

)

= A n i n n

s( ) ( )exp ( ) , (63)

where A(n) denotes the amplitude which will be scaled by energy adjustment; (n) denotes the frequency; and denotes the phase. When the patching relation alters, (n) also changes with the frequency location. Hence, the replicated tone can be regarded as a frequency modulated signal, making the trembling artifact easy to visualize.

Figure 4.4. Patching source change for low band replication.

(a) (b)

Figure 4.5. Tone trembling effect in spectrogram, where the vertical coordination is the frequency range from 0 to 22 kHz and the horizontal coordination is the time with frames: (a) normal spectrogram; (b) abnormal spectrogram.

4.3. Tone Shift Artifact

A tone-rich signal, e.g. flute sound, has a dense harmonic structure with regularly distributed tone series (see Figure 4.6 (a)). Tone-rich signals produce an apparent phenomenon in SBR called the “tone shift” artifact. As illustrated in Figure 4.6 (b), the direct replication of low bands leads to the obvious offsets between the recreated tones and the original ones. Exact matching of tones is almost impossible under direct replication.

SBR provides two mechanisms to correct spectral structures of replicated low bands. The first is the inverse filtering for eliminating undesired tones in replicated low bands. After that, the second mechanism allows to add sinusoids at the centers of the “high resolution frequency bands”. Using the two mechanisms still cannot avoid the tone shift artifact owing to the limited locations of tone addition. Fortunately, it is not easy to perceive the slight offsets, which may be due to the lower perceptual resolution of the critical bands at the HF range.

(a)

(b)

Figure 4.6. Tone shift effect: (a) original signal spectrum; (b) comparison of the original (with complete noise floor) and decoded spectra.

4.4. Noise Overflow and Tonal Spike

SBR can be regarded as a synthesis method for HF bands based on LF bands. The synthesis brings some distortions between the original and the simulated HF bands. The

“noise overflow” artifact is a common one in SBR due to the inaccuracy of tone number and tone energy in a T/F grid. The noise overflow artifact (see Figure 4.7) produces a rasping sound and significantly degrades the perceived quality. Tonal signals, such as the glockenspiel signal in Figure 4.7 (b), are very susceptible to this artifact. The accuracy of tonality measure is crucial to this artifact because underestimating tonal energy and/or overestimating noise energy directly leads to the noise overflow. However, since the SBR syntax restricts the frequency location and number of compensated tones, the noise overflow artifact is still unavoidable even with an accurate tonality measure.

Another reason of the noise overflow artifact is on the choice of the two envelope adjustment modes, “interpolation” and “non-interpolation” [5]. Figure 4.8 illustrates the two adjustment modes, where the energies of the original HF bands and those of the

corresponding replicated LF bands in a T/F grid are shown in Figure 4.8 (a) and (b) respectively, and the dashed line means the average energy of the original HF bands in the grid. In the interpolation mode, the energy of each subband in a T-F grid is adjusted to fit the average energy of the original high bands as depicted in Figure 4.8 (c). In contrast, in the non-interpolation mode, not adjusted individually, all the replicated bands in a T/F grid are adjusted up or down to fit the average energy as depicted in Figure 4.8 (d). By comparing the resultant envelops in the two modes (see Figure 4.8 (c), (d)), we can observe that the interpolation mode generates a flat envelop in a grid, whereas the non-interpolation mode maintains the original envelop shape of the replicated low bands.

(a)

(b)

Figure 4.7. Noise overflow due to tone loss: (a) noise overflow due to the tone losing; (b) the spectrogram of glockenspiel with noise overflow (top: the original, down: the compressed).

(a) (b)

Figure 4.8. Envelope adjustment at interpolation and non-interpolation modes: (a) energies of the original HF bands in a grid; (b) energies of the replicated LF bands in a grid; (c) adjusted energies of the replicated LF bands at interpolation mode; (d) adjusted energies of the replicated LF bands at non-interpolation mode.

In the interpolation mode, the inherent characteristic of flat envelop cannot fit well sharp envelopes of tonal bands. Hence, the interpolation mode needs to be considered carefully for tonal signals due to the noise overflow effect. In Figure 4.9, the original signal contains one tone in the indicated passband. Although a tone is replicated from LF, it is overwhelmed by the amplified noise in the interpolation mode. Compensating the last two tones avoids the artifact because the tonality is maintained by the tone addition mechanism. Figure 4.10 provides a counterpart without tone compensation, which reveals the immunity of the mechanism to the noise overflow artifact in the interpolation mode. Figure 4.11 compares the adjusted spectra in the two modes, where a serious noise overflow artifact occurs in the interpolation mode, whereas the envelop structure of the replicated low bands is maintained in the non-interpolation mode.

Oppositely, compensating excessive tones or insufficient noises makes a noise floor underflow and leads to the “tonal spike” artifact (see Figure 4.12) which produces a

“metallic” sound.

Figure 4.9. Noise overflow with tone compensation in interpolation mode.

Figure 4.10. Noise overflow without tone compensation in interpolation mode.

Figure 4.11. Noise overflow in interpolation and non-interpolation modes.

Figure 4.12. Tonal spike artifact.

4.5.Sawtooth Artifact

SBR decoder provides the “limited gain” mechanism [5] for avoiding excessive noise substitution which leads to serious noise overflow artifacts. The “limited gain” value g is evaluated as (64) for a limiter grid defined by the limiter frequency band table and time borders,

⋅

Φ_g E_g^H E_g^L , (64)

where E and _g^H E are the energies of the original HF and the replicated LF bands covered _g^L within the gth limiter grid; can be chosen as 0.70795, 1, 1.41254 or 10¹⁰ ( = 10¹⁰ , i.e. the limited gain mechanism is turned off). The limited gain restricts adaptively the upper bound of the maximum gain value for envelope adjustment so as to limit the degree of revision on the replicated low bands. The noise overflow artifact generally arises from a relatively larger scaling gain compared to other gains in a limiter grid. Therefore, restricting the upper bound can restrain the noise overflow artifact.

However, this protection mechanism brings about another artifact, named the “sawtooth”

artifact (see Figure 4.13 (b)). In Figure 4.13 (a), the original spectrum has a steep slope in the LF part and a flat slope in the HF part. To flat the steep slop for the HF part, some scaling gains must be much larger than others. The limited gain restrains the larger scaling gains and hence destroys the slop adjustment in the reconstructed spectrum.

(a)

(b)

(c)

Figure 4.13. Illustration of sawtooth artifact: (a) original audio signal spectrum; (b) decoded spectrum with sawtooth effect due to the limited gain mechanism; (c) decoded spectrum without sawtooth effect by turning off the limited gain mechanism.

4.6. Beat Artifact

When two tones are close to each other in frequency, their mutual interference generates amplitude fluctuation at a regular rate. The fluctuation in amplitude is known in the audio industry as the “beat” phenomenon [51]. For instance, when two equal-amplitude sine waves occur simultaneously, the resultant signal can be expressed as

) sin(

) cos(

2 ) sin(

) sin(

)

(t = ω₁t + ω₂t+φ = ∆ω⋅t+^φ₂ ωt+^φ₂

x , (65)

where = ( 2 1) / 2, and ω = ( 2 1) / 2. Once the frequencies of the two sine waves are close, i.e. is small, a special period is generated because the very low frequency cosine curve shapes the sine wave of a higher frequency. SBR has risks to generate the beats artifact

because the tones patched from low bands or the compensated ones have inaccurate positions.

For example, as shown in Figure 4.14 (c), after band replication (also see Figure 4.15), one replicated tone is closed to another tone in the low band. Figure 4.14 (d) shows that the cosine envelope is imposed on the signal waveform. In perception, the fluctuation can be perceived obviously.

(a)

(b)

(c)

(d)

Figure 4.14. Beat artifact: (a) original spectrum containing two tones with large distance; (b) time-domain waveform for (a); (c) decoded spectrum containing two tones with small distance; (d) time-domain waveform for (c).

Figure 4.15. Explication of the beat artifact in Figure 4.14: (a) original spectrum; (b) decoded AAC LF spectrum; (c) HF generation by SBR; (d) HF adjustment by SBR.

4.7. Linear Predictive Bias on CEMFB Subbands

Rather than the cosine modulated filterbank (CMFB) commonly employed in audio coding, SBR utilizes the comparatively high-complexity complex-exponential modulated filterbank (CEMFB) [8] to eliminate main alias terms and thus avoid the alias artifact introduced from spectral adjustment or equalization. In this section, however, we demonstrate that when applied to the CEMFB subbands, the conventional LP method defined in the SBR standard has natively the predictive bias which affects the whitening effect and the noise-to-signal ratio (NSR) measure. We demonstrate the predictive bias through the first-order and second-order autoregressive (AR) modeling on analytic signals together with the empirical verification on the CEMFB subbands. Subsequently, the new filter, named the decimation-whitening filter, is proposed to remove the bias for the SBR algorithm.

4.7.1. CEMFB Subbands and Analytic Signals

The discrete-time analytic signal x+(n) corresponding to a real signal x(n) [52] is defined as x(n)+ jxˆ(n),wherexˆ n( )denotes the discrete-time Hilbert transform of x(n):

∞

≠

−∞

−

⋅

0 ,

) ) (

2 / ( sin ) 2

ˆ(

k k

k n k x

n k

x π

π ^. ⁽⁶⁶⁾

In the frequency domain, the relation between the original and analytic signals is given by

Similarly, the analytic signal x−(n) containing merely the negative spectrum can be defined as ) sine modulated versions of the same prototype filter, which can be interpreted as the Hilbert transforms of the real part. Accordingly, the resultant subbands decimated by M can be approximately regarded as the analytic signals of the real output obtained from the CMFB [8].

Moreover, the CEMFB subbands alternately consist of positive and negative analysis signals.

In the absence of either the positive or negative side band, the excitation noise for each CEMFB subband can be also regarded as the analytic signal that has flat power spectrum

在文檔中知覺式音訊編碼壓縮瑕疵之探討 (頁 52-0)