Chapter 6 Artifacts in SBR
6.2 Tone Shift
In tone-rich signal, the tone components usually distribute regularly. Figure 26 shows this phenomenon. In this kind of signal, using inverse filter and adding additional tone components is ineffective and bit-consuming. Furthermore, sinusoids in the frequency transform to signals with constant magnitude in the time domain.
Therefore, the non-clipping method should be better than the clipping method either in time or frequency domain, i.e. no any time borders or additional components are needed. However, it will cause tone shift artifact in the reconstructed signal. This artifact is referred to “tone shift” because of its spectrum shape. In Figure 27, it shows this phenomenon. The blue line represents the spectrum of original signal, and the red one is reconstructed by HE-AAC. It is clear to see that, in SBR range, the tone components have offsets comparing to original ones, but they still keep regular.
However, in the perceptual hearing, this artifact is almost hard to be discovered.
Figure 26: An example for characteristics of tone-rich signals
6.3 Sawtooth
The limiter gain mechanism in SBR decoder is to avoid unwanted noise substitution. It restricts the maximum values for gain. As mention in Chapter 2, these maximum gain values are calculated according to a limiter frequency band table which has coarser frequency resolution than original gain values use. Hence, the limiter gain mechanism implicates to preserve the envelope of the replicated signal.
Limiter gain values can be assumed as the maximum rescaling value for adjusting the replicated contents to original ones. If the rescaling value exceeds limiter gain, it represents that the adjusting of replicated signal is immoderate and may destroy the continuity of spectrum. However, this protection mechanism may bring artifact when the envelope of the replicated signal highly differs from the envelope of original one.
In Figure 28, the envelope of high bands is flat, but the envelope of low bands is sharp.
In such situation, in order to adjust reconstructed contents as similar as original ones, revising the envelope of replicated signal is necessary. Therefore, if the limiter gain mechanism always turns on, the adjustment is restricted, and the resulting envelope of reconstructed signal may be discontinuous. This phenomenon is named as sawtooth artifact due to the discontinuous envelope, which is illustrated in Figure 29.
AAC SBR
AAC SBR
Figure 28: The envelope comparison between high bands and low bands. The envelope of high bands is flat, and the envelope of low bands is sharp.
AAC SBR
AAC SBR
Figure 29: Sawtooth effect due to the limiter gain mechanism.
6.4 Noise Floor Overflow
Noise floor overflow is a common artifact in SBR codec. There are two main reasons causing this phenomenon. The first one is the missing of tone detection in T/F grid. After the envelope adjustment in decoder, the inconsistent content of the noise-like replicated low band and the tonal original high band will cause the huge promotion of the noise floor in the low band to compensate the energy of the lost tones. The “noise-floor overflow” phenomenon, as shown in Figure 30, produces a fizzy sound in perception and results in serious quality degradation.
Figure 30: Noise floor overflow due to failure of detecting tones in high bands.
The accuracy of tonality measurement is critical for the artifact, because either the underestimation of the tonal energy or the overestimation of the energy of noise component will lead to the noise-floor overflow. However, the constraint of SBR syntax will restrict the location and the number of the added additional tones, and hence the noise-floor overflow effect is still unavoidable possibly even with accurate
On the other hand, the second reason causing noise floor overflow is the interpolation mode. As mention in Chapter 2, the estimation for current SBR frame envelope has two different modes, interpolation and non-interpolation. By comparing the resultant envelops in the two modes, the interpolation mode will generate the flat envelop, and oppositely the non-interpolation mode will maintain the original envelop structure of the replicated low bands. In other word, under the interpolation mode, the inherent characteristic of the envelop flatness does not agree with the sharp envelop of the tonal bands. Hence, due to the noise-floor overflow effect, the envelope estimation mode should be switched between interpolation and non-interpolation. When the envelope of high bands is flat, interpolation mode is selected. Oppositely, as the envelope of high band is sharp, it needs to change into non-interpolation mode.
However, the information of tonality can presents the characteristic of relating envelope. Therefore, the estimation mode can be determined according to the tonalities. As shown in Figure 31, there is a serious noise-floor overflow around the first tone which is replicated from low bands and almost overwhelmed by the amplified noise. The last two tones which are compensated additionally have no the artifact. This is because the tonality information is kept by the tone adding mechanism.
Once without the mechanism as in Figure 32, the artifact occurs again. This also presents the immunity of the tone adding mechanism against the noise-floor overflow effect under interpolation mode.
Figure 31: Noise floor overflows due interpolation mode. The target circle indicates the noise floor overflow is from the averaged energy with tone component.
Figure 32: A comparison to Figure 31. It incident the result without tone addition mechanism.
Chapter 7 Experiments
In this chapter, extensive experiments are made to prove the enhancement of proposed methods through objective and subjective measurements. NCTU-AAC [20]
is adopted as core encoder in our NCTU HE-AAC [21]. The MPEG test tracks are chosen as our default track database. Next, the experiment on the PSPLAB audio database [22] is executed to prove the robustness of our design. Through both objective and subjective tests, the efficiency and quality of our proposed methods are well examined.
7.1 Measurement Tools Description Objective Quality Measurement Tool
In the objective test, we choose EAQUAL as the tool to assess the audio quality.
EAQUAL stands for Evaluation of Audio Quality. The objective difference grade (ODG) is the output variable from this objective measurement tool. The range of the ODG value is from 0 to -4, where 0 presents an imperceptible impairment and -4 correspond to a very annoying impairment. The improvement up to 0.1 is usually perceptually audible. The implementation of EAQUAL is based on the ITU-R recommendation BS.1387 [23].
Subjective Quality Measurement Tool
We choose the tool called “MUSHRA” to conduct the subjective test. MUSHRA (Multiple Stimulus with Hidden Reference and Anchors) is proposed to give a reliable and repeatable measure of the audio quality of intermediate-quality signals.
MUSHRA has the advantage that it provides an absolute measure of the audio quality of a codec which can be compared directly with the reference, i.e. the original audio signal as well as the anchors [24]. MUSHRA follows the test method and impairment scale recommended by ITU-R BS.1116 [25].
7.2 Objective Quality Measurement in MPEG Test Tracks
The twelve tracks, which contain critical music balancing on the percussion,
string, wind instruments and human vocal, recommended by MPEG, are included in the assessment of audio quality. Table 3 shows the characteristics and details of these tracks. In the section, the quality improvement of the proposed methods at different bit rates is considered based on these MPEG test tracks.
Signal Description Tracks
Signal Mode Time (sec) Remark
1 es01 vocal (Suzan Vega) Stereo 10 (c)
2 es02 German speech Stereo 8 (c)
3 es03 English speech Stereo 7 (c)
4 sc01 Trumpet solo and orchestra Stereo 10 (b) (d)
5 sc02 Orchestral piece Stereo 12 (d)
6 sc03 Contemporary pop music Stereo 11 (d)
7 si01 Harpsichord Stereo 7 (b)
8 si02 Castanets Stereo 7 (a)
9 si03 pitch pipe Stereo 27 (b)
10 sm01 Bagpipes Stereo 11 (b)
11 sm02 Glockenspiel Stereo 10 (a) (b)
12 sm03 Plucked strings Stereo 13 (a) (b)
Remarks:
(a) Transients: pre-echo sensitive, smearing of noise in temporal domain.
(b) Tonal/Harmonic structure: noise sensitive, roughness.
(c) Natural vocal (critical combination of tonal parts and attacks):
distortion sensitive, smearing of attacks.
(d) Complex sound: stresses the Device Under Test.
Table 3: The twelve tracks recommended by MPEG
Six different methods are compared at different bit rates from 48kbps to 112kbps.
The first one is NCTU-AAC; from the second one to the fifth one are NCTU HE-AAC with uniform “cuts” in T/F grid with 0, 1, 3, and 7 cuts respectively. The frequency table is suggested by SBR standard [3]. The sixth one is NCTU HE-AAC with the proposed T/F grid design.
Bit Rate 112 kbps Coding
Methods 1 2 3 4 5 6
es01 -0.35 -0.88 -0.67 -0.6 -0.56 -0.56
es02 -0.11 -0.86 -0.61 -0.57 -0.55 -0.49
es03 -0.19 -1.63 -0.87 -0.6 -0.57 -0.58
sc01 -0.61 -0.6 -0.59 -0.6 -0.6 -0.57
sc02 -1.13 -0.63 -0.59 -0.58 -0.58 -0.62
sc03 -0.92 -0.89 -0.76 -0.73 -0.72 -0.74
si01 -1.03 -1.28 -1.05 -1.03 -0.95 -1
si02 -0.92 -2.21 -1.38 -1.04 -0.99 -0.7
si03 -1.49 -1.06 -1.07 -1.07 -1.05 -1
sm01 -1.18 -1.19 -1.18 -1.18 -1.11 -1.07
sm02 -0.62 -1.72 -1.16 -1.16 -1.11 -1.21
sm03 -1.19 -1.75 -0.95 -0.91 -0.9 -0.89
Average -0.8117 -1.225 -0.9067 -0.8392 -0.8075 -0.7858 Sample Rate: 44100 Hz
Coding Method:
1: NCTU-AAC; 2: NCTU HE-AAC without any cuts in T/F grid; 3: NCTU HE-AAC with uniform 1 cut; 4: NCTU HE-AAC with uniform 3 cuts; 5: NCTU HE-AAC with uniform 7 cuts; 6: Proposed DP design of T/F gird.
Table 4: Objective measurements through ODGs for different T/F grid design in HE-AAC at 112 kbps.
Figure 33: The ODG variance comparison of Table 4.
Bit Rate 96 kbps
Coding
Methods 1 2 3 4 5 6
es01 -0.54 -0.91 -0.7 -0.64 -0.6 -0.6
es02 -0.23 -0.86 -0.62 -0.58 -0.57 -0.51
es03 -0.37 -1.65 -0.9 -0.62 -0.61 -0.6
sc01 -1.01 -0.74 -0.71 -0.72 -0.74 -0.68
sc02 -1.7 -0.75 -0.72 -0.71 -0.72 -0.74
sc03 -1.5 -1.01 -0.88 -0.85 -0.84 -0.86
si01 -1.77 -1.5 -1.22 -1.23 -1.19 -1.18
si02 -1.27 -2.31 -1.51 -1.13 -1.07 -0.79
si03 -2.56 -1.28 -1.31 -1.34 -1.34 -1.22
sm01 -2.22 -1.36 -1.34 -1.34 -1.3 -1.23
sm02 -1.05 -1.84 -1.3 -1.29 -1.26 -1.3
sm03 -1.87 -1.97 -1.15 -1.13 -1.14 -1.08
Average -1.3408 -1.3483 -1.03 -0.965 -0.9483 -0.8992 Sample Rate: 44100 Hz
Coding Method:
1: NCTU-AAC; 2: NCTU HE-AAC without any cuts in T/F grid; 3: NCTU HE-AAC with uniform 1 cut; 4: NCTU HE-AAC with uniform 3 cuts; 5: NCTU HE-AAC with uniform 7 cuts; 6: Proposed DP design of T/F gird.
Table 5: Objective measurements through ODGs for different T/F grid design in HE-AAC at 96 kbps.
Figure 34: The ODG variance comparison of Table 5.
Bit Rate 80 kbps
Coding
Methods 1 2 3 4 5 6
es01 -0.8 -0.97 -0.77 -0.71 -0.69 -0.67
es02 -0.49 -0.92 -0.68 -0.66 -0.64 -0.59
es03 -0.74 -1.71 -0.96 -0.71 -0.71 -0.68
sc01 -1.61 -1 -1 -1.02 -1.08 -0.96
sc02 -2.49 -1.07 -1.05 -1.05 -1.1 -1.08
sc03 -2.47 -1.27 -1.13 -1.1 -1.11 -1.11
si01 -2.78 -1.97 -1.63 -1.61 -1.57 -1.59
si02 -1.94 -2.43 -1.65 -1.31 -1.27 -1
si03 -3.66 -1.65 -1.67 -1.71 -1.71 -1.6
sm01 -3.38 -1.63 -1.62 -1.65 -1.66 -1.55
sm02 -1.82 -2.02 -1.58 -1.56 -1.55 -1.52
sm03 -2.6 -2.21 -1.41 -1.37 -1.44 -1.3
Average -2.065 -1.5708 -1.2625 -1.205 -1.2108 -1.1375 Sample Rate: 44100 Hz
Coding Method:
1: NCTU-AAC; 2: NCTU HE-AAC without any cuts in T/F grid; 3: NCTU HE-AAC with uniform 1 cut; 4: NCTU HE-AAC with uniform 3 cuts; 5: NCTU HE-AAC with uniform 7 cuts; 6: Proposed DP design of T/F gird.
Table 6: Objective measurements through ODGs for different T/F grid design in HE-AAC at 80 kbps.
Figure 35: The ODG variance comparison of Table 6.
Bit Rate 64 kbps
Coding
Methods 1 2 3 4 5 6
es01 -1.55 -1.58 -1.09 -0.98 -0.98 -0.93
es02 -1.16 -1.45 -0.97 -0.89 -0.93 -0.82
es03 -1.53 -2.49 -1.46 -1 -1 -0.96
sc01 -1.96 -1.79 -1.76 -1.84 -1.98 -1.56
sc02 -3 -1.68 -1.67 -1.69 -1.82 -1.65
sc03 -3.38 -1.79 -1.63 -1.55 -1.62 -1.58
si01 -3.54 -2.31 -1.95 -1.95 -2.05 -1.93
si02 -2.85 -2.77 -1.95 -1.68 -1.72 -1.35
si03 -3.85 -2.07 -2.12 -2.2 -2.33 -2.04
sm01 -3.78 -2.18 -2.19 -2.21 -2.3 -2.12
sm02 -2.77 -3.03 -2.17 -2.09 -2.36 -2.18
sm03 -3.2 -2.64 -1.78 -1.77 -1.86 -1.64
Average -2.7142 -2.1483 -1.7283 -1.6542 -1.7458 -1.5633 Sample Rate: 44100 Hz
Coding Method:
1: NCTU-AAC; 2: NCTU HE-AAC without any cuts in T/F grid; 3: NCTU HE-AAC with uniform 1 cut; 4: NCTU HE-AAC with uniform 3 cuts; 5: NCTU HE-AAC with uniform 7 cuts; 6: Proposed DP design of T/F gird.
Table 7: Objective measurements through ODGs for different T/F grid design in HE-AAC at 64 kbps.
Figure 36: The ODG variance comparison of Table 7.
Bit Rate 48 kbps
Coding
Methods 1 2 3 4 5 6
es01 -3.12 -2.02 -1.58 -1.58 -1.74 -1.48
es02 -2.49 -1.81 -1.42 -1.47 -1.8 -1.28
es03 -3.23 -2.85 -1.97 -1.71 -2.08 -1.67
sc01 -2.5 -2.73 -2.71 -2.73 -2.92 -2.34
sc02 -3.14 -2.55 -2.53 -2.51 -2.54 -2.47
sc03 -3.63 -2.54 -2.35 -2.32 -2.41 -2.29
si01 -3.77 -3.03 -2.75 -2.75 -2.9 -2.68
si02 -3.58 -3.26 -2.67 -2.61 -2.66 -2.26
si03 -3.88 -3.15 -3.21 -3.29 -3.42 -3.15
sm01 -3.88 -3.18 -3.19 -3.25 -3.41 -3.17
sm02 -3.33 -3.63 -3.21 -3.26 -3.29 -3.25
sm03 -3.51 -3.09 -2.41 -2.41 -2.54 -2.2
Average -3.3383 -2.82 -2.5 -2.4908 -2.6425 -2.3533 Sample Rate: 44100 Hz
Coding Method:
1: NCTU-AAC; 2: NCTU HE-AAC without any cuts in T/F grid; 3: NCTU HE-AAC with uniform 1 cut; 4: NCTU HE-AAC with uniform 3 cuts; 5: NCTU HE-AAC with uniform 7 cuts; 6: Proposed DP design of T/F gird.
Table 8: Objective measurements through ODGs for different T/F grid design in HE-AAC at 48 kbps.
Figure 37: The ODG variance comparison of Table 8.
Summary
In above experiments, method 3 represents the low bit-consuming in T/F grid, and methods 4 and 5 represent medium and high bit-consuming respectively.
Therefore, at high bit rates, such as 112 kbps and 96 kbps, method 5 is better than methods 3 and 4. Oppositely, at low bit rates, such as 64kbps and 48kbps, method 5 is getting worse than methods 3 and 4. However, the ODG of our proposed design is the best among these coding methods at every bit rate. The ODG and bit rates comparison curves are illustrated in Figure 38. The curve demonstrates the relation between ODG degeneracy with respect to bit rates. From Figure 38, the quality of AAC reduces with the bit rates rapidly. On the other hand, the curves of SBR codec with bit rates can be controlled to be more smooth than AAC codec. Furthermore, the curve of our design is on top of others, which shows the efficiency of the proposed method.
Figure 38: ODG-bit rate curve comparison among different T/F grid design.
7.3 Objective Quality Measurement in Music database
risk for a variety of audio categories, PSPLAB audio database [22] is adopted as testing samples. The database includes 327 tracks which are separated into 16 sets with different signal properties as shown in Table 9.
Bitstream Categories Number of
Tracks Remark
1 FF123 103 Killer bitstream collection from ff123.
2 Gpsycho 24 LAME quality test bitstream.
3 HA64KTest 39 64 kbps test bitstream for multi-format in HA forum.
4 HA128KTestV2 12 128 kbps test bitstream for
multi-format in HA forum.
5 Horrible_song 16 Collections of critical songs among all bitstream in PSPLab.
6 Ingets1 5 Bitstream collection from the test of OGG Vorbis pre 1.0 listening test.
7 Mono 3 Mono test bitstream.
8 MPEG 12 MPEG test bitstream set for 48KHz.
9 MPEG44100 12 MPEG test bitstream set for 44100 Hz.
10 Phong 8 Test bistream collection from Phong.
11 PSPLab 37 Collections of bitstream from early age of PSPLab. Some are good as killer.
12 Sjeng 3 Small bitstream collection by sjeng.
13 SQAM 16 Sound quality assessment material
recordings for subjective tests.
14 TestingSong14 14 Test bitstream collection from rshong.
15 TonalSignals 15 Artificial bitstream that contains sin wave etc.
16 VORBIS_TESTS_Samples 8 First 8 Vobis testing sample from HA.
Total 327
Table 9: The PSPLAB audio database.
In this section, there are two different coding methods used to be compared with our design. The first one is NCTU HE-AAC with uniform 1 cut in T/F grid, and the second one is NCTU HE-AAC with uniform 7 cuts. The two coding methods represent different bit-consuming degree for T/F grid respectively and both adopt the frequency table suggested in SBR standard.
Figure 39: The average ODG of three coding methods in PSPLAB audio database at bit rate 80kbps and sampling rate 44100 Hz. M1 is uniform 1 cut in T/F grid and M2
is uniform 7 cuts in T/F grid. M3 is our design.
Figure 40: The average ODG of three coding methods in PSPLAB audio database at bit rate 64kbps and sampling rate 44100 Hz. M1 is uniform 1 cut in T/F grid and M2
Figure 41: The average ODG of three coding methods in PSPLAB audio database at bit rate 48kbps and sampling rate 44100 Hz. M1 is uniform 1 cut in T/F grid and M2
is uniform 7 cuts in T/F grid. M3 is our design.
Summary
As mention above, at bit rate 80kbps, M2 is better than M1, and on the contrary, M1 is better than M2 at bit rate 48kbps. However, our T/F grid design is better than the other two coding methods for most sets in audio database. It proves the robustness and flexibility of our design.
There are two sets needed to be observed, FF123 and TonalSignals, because the ODG of our design is worse in both sets. Through analyzing the signals in these two sets, the problem tracks can be separated into three categories, frequency table selection, tone-vanishing and noise floor overflow. The higher resolution frequency table is needed due to two situations; one is the envelope of high bands alters rapidly, and the other one is adding tone components. In Figure 42 and Figure 43, there are two examples showing the unstable high band envelopes. Figure 42 shows that the envelope alters hugely by time samples, and in Figure 43, the envelope is very sharp.
Figure 44 shows another situation which needs higher resolution frequency table.
When additional tone components are adding, detailed frequency table can increase the coding precision. Therefore, the policy of frequency table selection should be revised. In the most time, the frequency table should be coarse to save bits, and when the high band envelope is unstable, the higher resolution table needs to be used.
Figure 42: The example for signal whose high band envelope alters rapidly.
Figure 43: The example for signal which has a sharp high band envelope.
Figure 44: The error of noise floor due to tone addition.
Another problem occurs in the audio tracks showed in Figure 45. Through SBR codec, the spectrum of reconstruction signal becomes discontinuous segments.
However, comparing to method 2 which uses uniform 7 cuts in T/F grid, our T/F grid design makes tone vanish for some frames in reconstruction signal. From Figure 46 and Figure 47, the tone-vanishing phenomenon is evident. There are four tracks have this property, which are sweep, halfsweep, halfsweepinvert and 20k-20. In addition, we also take aacPlus [12] to test these tracks, and the result is showed in Figure 48. It is clear to see that tone component is missing and replaced by noise. Figure 49 shows another similar spectrum, the difference is the tone component is not continuous in frequency domain. The reconstruction signal by NCTU HE-AAC and aacPlus is illustrated in Figure 50 and Figure 51 respectively. Besides to tone-vanishing phenomenon, the energy of added tone by NCTU HE-AAC is lower than original one.
Figure 45: The spectrum of “sweep” which has continuous tone from 10 KHz to 22 KHz.
Figure 46: The reconstructed spectrum of Figure 45 through uniform 7 cuts in T/F grid.
Figure 47: The reconstructed spectrum of Figure 45 through DP T/F gird design.
Figure 48: The reconstructed spectrum of Figure 45 through aacPlus.
Figure 49: The spectrum of “sin_300_625_1k_5k_10k_15K_20k_m20db” which has interrupted tone from low frequency bands to high frequency bands
.
Figure 50: The reconstructed spectrum of Figure 49 through DP T/F gird design.
Figure 51: The reconstructed spectrum of Figure 49 through aacPlus.
The last problem is noise floor overflow due to interpolation mode. There are two tracks in TonalSignals set with serious noise floor overflow, sin_600_19800_9div_m20_0db and sin_9kind_valious. Take sin_9kind_valious for example, the frequency analysis is illustrated in Figure 52. The reconstruction signal by our design is showed in Figure 53. Therefore, in harmonic signal, non-interpolation should be used.
Figure 52: The frequency envelope of “sin_9kind_valious”.
Figure 53: Noise floor overflow due to interpolation mode.
7.4 Subjective Quality Measurement
After the objective quality measurement, we perform subjective listening test to verify the quality improvement and possible risk of proposed methods in this thesis.
The subjective quality measurement bases on MPEG test tracks, and use the tool called “MUSHRA” to be an assistant. There are three coding methods compared. The first one is NCTU HE-AAC with no any cuts in T/F grid, the second one is NCTU HE-AAC with uniform 7 cuts in T/F gird, and the final one is our proposed design.
The testing bit rate is 80 kbps and the result is illustrated in Figure 54.
Figure 54: The result of subjective test at 80 kbps.
Summary
behind to M2 and M3, because this kind of signal needs finer resolution of T/F grid.
On the other hand, the three methods have almost the same quality for sc01, sc02 and sc03, because in such signals, the contents of low bands are very similar to high bands, consequently, coarse resolution is enough. Especially, in sc02, M2 is worse than the other two methods due to the immoderate cutting and consumes too many bits. si02 is a critical signal to prove the advantages of our design. In si02, there are many
“attacks” in time domain; therefore, the number and location of time borders in T/F grid are very important. Among three coding methods, M1 has no enough resolution, and M2 is lack of precision of time borders. However, our proposed DP design can handle this signal very well. si03 is a harmonic signal, and therefore, M3 is worse than M2 due to noise floor overflow by interpolation. To summarize, the proposed method perform well in subjective quality and conforms to the objective measurement.
7.5 Objective Quality Measurement with Existing Codecs
In this section, we compare NCTU HE-AAC with Coding Technologies 7.0.5 [12]
In this section, we compare NCTU HE-AAC with Coding Technologies 7.0.5 [12]