T/F Stereo Parameter Extraction Summary - T/F Stereo Parameter Extraction

Chapter 3 T/F Stereo Parameter Extraction

3.4 T/F Stereo Parameter Extraction Summary

From above sections, the thesis introduces the existing method, which uses the fixed stereo parameter sets, and suggests an adaptive T/F stereo parameter extraction to avoid the lack of existing method. This adaptive method has been already published in [19]. The quality measurement will be shown in Chapter 5 to assess the improvement of this method.

Chapter 4 Downmix Method

4.1 Downmix Method Concept

The purpose of PS coding is to combine the stereo signal into a monaural signal with some stereo parameters. Because of that, bit-rate can be greatly reduced.

Moreover, in such low bit-rate, only few bits provide PS usage, the most bits suppose the encoding of monaural signal to keep the certain coding quality.

Therefore, the monaural downmix signal is the essential quality source for the reconstructed stereo signal. The design issue of downmix method seems to be the way to preserve the information of the original stereo signal. Furthermore, in Chapter 2 the mixing procedure suggests some restrictions of the downmix signal. In downmix method, it should also consider these restrictions. Following sections will firstly discuss the averaging approach and then suggest an approach to avoid the problems in the averaging approach.

4.2 Averaging Approach

In PS draft [10], the monaural downmix signal is generated according to 2

M = L+ . (14)

This approach only calculates the average signal of the stereo input signal. Therefore, there might be energy cancellation problem in the downmix signal illustrated in Figure 17. If two channels are anti-phase and have same magnitude, the monaural downmix signal will be totally cancelled. Under this energy cancellation, the PS tool can not reconstruct the original signal. In other words, the stereo information between two channels might be lost. Also this problem violates the restriction of mixing procedure which demands the energy of downmix signal is the average energy of the stereo signal to keep the conservation of energy.

Averaging

downmix approach

Figure 17: Energy cancellation problem in averaging approach

The method adopted in 3GPP [13] is based on the averaging approach. It uses a post-processing to cater for the demand of mixing procedure. An energy adjusting scale is used in the post-processing. This method seems to procure the mixing restriction, but it also can not save the stereo information which is disappeared in the average signal. Besides, the destroyed spectrum structure which faces cancellation problem is still same as before when using scaling method. Therefore, it is easy to implement the average approach but this approach which does not conform to the signal content would wreck the signal structure.

4.3 Downmix Method based on Karhunen-Loève Transform

As mentioned above, the design issue of the downmix method is to preserve the most information of original stereo signal. In the following sections, the thesis will introduce an approach based on the Karhunen-Loève Transform.

4.3.1 KLT-based Approach

Our goal here is to generate the features via linear transforms of the stereo signal. The basic concept is to transform a given set of samples to a new set of features. Karhunen-Loève Transform (KLT) [20] is such an approach. Its transform domain features can exhibit high information packing properties. This means that the most of the classification-related information is squeezed in a relatively small number of features, leading to a reduction of the necessary feature space dimension.

To adopt KL transform as the downmix coefficient vector, it firstly defines some symbols as follow:

U2×32 sample matrix which contains samples in two relative subbands V2×32 transformed matrix

RU,RV auto-correlation matrixes of sample matrix and transformed matrix Φ2×2 KL transform matrix

By above symbol definitions, the transform equation can be written as:

U

V = Φ

^T . (15)

The KL transform matrix Φ is given by orthonormal eigenvectors of RU. After KL transform, RV will be an uncorrelated matrix. Because the transform is used for downmix usage, the resultant sample set is the row in V with the eigenvector which corresponds to the large eigenvalue. In other words, the eigenvector with the large eigenvalue is the downmix coefficient vector to generate the monaural signal.

Therefore, it is the optimal transform in terms of energy compaction because it makes basis vectors uncorrelated and orthogonal. Figure 18 shows a simple view for KLT flow path. It firstly build auto-correlation matrix from two relative subbands and then calculate the wanted eigenvector. The eigenvector is delivered to downmix method for generating the monaural signal. The block diagram which illustrated the PS encoder with KLT module is shown in Figure 19.

Build Auto-correlation

Matrix Build Auto-correlation

Matrix

Target Eigenvector Calculation & Delivery

Begin Begin

DoneDone

Eigenvector Calculation Eigenvector Calculation

Build Auto-correlation

Matrix Build Auto-correlation

Matrix

Target Eigenvector Calculation & Delivery

Begin Begin

DoneDone

Eigenvector Calculation Eigenvector Calculation

Figure 18: Flow chart of Karhunen-Loève Transform

Bit-stream Packing

Figure 19: Block diagram of the PS encoder including KLT module

4.3.2 Artifacts under KLT-based Approach

To achieve the advantage of energy compactness, the KLT method suffers more risks than the simple average method. The feature that the weaker signal component is usually discarded and the adaptive property of signal combination coefficients are the main causes of the unwanted artifacts from the KLT method. The following subsections focus on the two critical artifact phenomena which named “Tone Leakage” effect and “Tone Modulation” effect.

4.3.2.1 Tone Leakage Effect

Here the two types of the “tone leakage” effect are defined to describe the two difference phenomena caused from the downmixing process. At first, the type-I tone leakage effect is defined to specify the phenomenon that one tone in some channel leaks to another channel after the upmixing procedure. This is an inherent artifact of downmixing coding, and hence both the average and the KLT methods suffer unavoidably. On the other hand, another situation is that one tone entirely or almost disappears in the both channels. The situation is named as the type-II tone leakage effect. Both the two method have the kind of artifact due to the different causing reason. However, the type-II tone leakage effect is usually inseparable from the KLT method. To show the tone leakage effect, a simple example is introduced and we compare the resultant severity between the two methods.

Figure 20: Simple example for tone leakage

In Figure 20, the left channel and right channel has a tone of different frequency respectively and there is a magnitude difference of 12 dB between two tones. By the average method, the monaural signal retains both the two tones from the different channel and only decreases their magnitude. As shown in Figure 21, the type-I tone leakage effect occurs on the two tones. Although each channel maintains itself tone component, the imposed external tones are also introduced into the opposite channel after the upmixing procedure.

Figure 21: Reconstructed signal under averaging method for tone leakage On the contrary, to ensure the maximum variance of data information by the KLT method, the transform process of data dimension reduction may cause the biased preference between the two channels. Especially, when there is a great difference of energy between the two channels, the coefficient vector tends to save the more dominate channel for energy compactness. In other word, the weaker channel is sacrificed inevitably and hence looses its spectral structure in the extracted downmixing signal. Therefore as shown in Figure 22, the reconstructed stereo signal only keeps the more dominate tone, and the weaker tone is suppressed to nearly disappear. This presents an example of the type-II tone leakage effect for the KLT method.

Figure 22: Reconstructed signal under KLT method for tone leakage

In conclusion, either the average method or the KLT method has the type-I and type-II tone leakage effects. Because of the inherent property that the component in the monaural signal is certainly reproduced into the reconstructed binaural signal, any downmixing method without other auxiliary information will suffer the type-I tone leakage effect. On the other hand, although the dominate tone can be held better by the KLT method than by the average method, the weaker channel is always sacrificed due to the biased ratio of combination.

Unlike for the KLT method, the type-II tone leakage effect occurs infrequently in the average method unless the two tones are exactly cancelled to each other nearly.

However, energy weakness degrades greatly the quality of the average method. The contradictory between the stereo image conservation and the energy compactness is obviously the main design issue and compromise for the downmixing policy.

Therefore, some amendments need to be done in KLT method to avoid the type-II tone leakage effect.

4.3.2.2 Tone Modulation Effect

Unlike the fixed combination coefficient for the average method, the coefficient vectors of the KLT must be adaptive frame by frame to achieve the optimal energy conservation. However, the adaptation results in the connection discontinuity of adjacent spectrums of the monaural, and brings an annoying effect that sounds like

“click”.

Figure 23: Example of tone modulation

Figure 23 illustrates a series of reconstructed spectrum under the KLT method.

It contains spectrums of successive five frames. The red line indicates original spectrum and the blue line is reconstructed spectrum. There is an unusual phenomenon, named as “tone modulation” effect, shows the tone shape expands and contracts as time goes on. To analytically understand the cause, a downmixing subband signal should be represent by the linear combination of the left and right subband signals as

] [ )) ( exp(

] [ ] [ )) ( exp(

] [ ]

[n ₁ n i ₁ n l n ₂ n i ₂ n r n

d =λ θ +λ θ , (16)

where λk[n]exp(iθk[n]) k=1,2 means the polar form of the combination coefficient, and l[n],r[n] are the left and right subband signals respectively. The influence of multiplierλk[n]exp(iθk[n]) will cause the modulation in both amplitude and phase.

For example, consider a sinusoid signal

)) (

exp(

]

[n = A i n+ Θ

s ω , (17)

and the modulated signal

])) [ (

exp(

]) [ ( ]

ˆ[n A n i n n

s = ⋅λ ω +Θ+θ _. ₍₁₈₎

The multiplierλ[n]exp(iθ[n]) can be viewed as a step function of time index n that is constant in each frame, but may jumps hugely at the frame bounders.

Assume s[n] is a tone component coupled into the monaural channel, its amplitude and frequency should be changed largely frame by frame, and hence its spectral structure will have the modulation effect. In other word, the downmixing procedure of the KLT method is equivalent to combine the two signals with mixed modulation in both amplitude and frequency, and results in the annoying “tone modulation”

effect.

4.3.2.3 Pre- and Post-processing for KLT-based Approach

In the previous subsections, it introduces the risk under KLT method. However, the energy cancellation in the simple average method greatly degrades the quality.

Consequently, to keep the advantage from KLT method but improve the unwanted artifacts is main issue to be solved.

Energy Normalize during pre-processing:

Energy normalization is a pre-processing for KLT method. From above discussion of tone leakage effect, the weaker signal component is usually entirely or nearly discarded in KLT method. Thus, KLT method should be revised to avoid these imbalance situations. To eschew KLT method snubs the weaker signal components, a method which lets two channels adjust their samples by the energy is used here. This method can give two channels equal priority for transform calculation. Therefore, the coefficient weight of two channels will be decided by theirs signal variation.

Coefficient Vector Smooth during post-processing:

Any adaptive mechanism of channel coupling, like the KLT method, will result in the spectrum discontinuity problems such as the tone modulation effect. An enhancement is to smooth the coupling coefficients to avoid the transient spectral discontinuity in the monaural signal. Similar to the PSOLA method commonly used to waveform synthesis in speech processing, the coefficients of adjacent frames can be smoothed by the connection of the smooth function. Thereinafter, the thesis introduces the smooth methods for KLT method as a post-processing.

Two-way method:

Need previous and next transform vector

Backward method:

Only need previous transform vector

F_i F_i+1 F_i+2

Two-way method:

Need previous and next transform vector

Backward method:

Only need previous transform vector

F_i F_i+1 F_i+2

Figure 24: The smooth methods of KLT coefficient vector

Figure 24 shows the two different smooth methods: “Two-way method” and

“Backward method”. Two-way method refers to the previous and next coefficient

vectors and backward method references only previous one. Here the thesis suggests using the backward method. Because two-way method must look ahead the component in next frame which let the implementation inconveniently. Also, the smooth length in both two methods is the same as each other. Thus, the backward method can handle the smooth procedure and is easily implemented.

Also the smooth curve is an issue. The simple idea is the linear smoothness.

But the linear smoothness has the discontinuity at the beginning and end points.

Figure 25 shows this discontinuity problem, where λi andλi+1 indicate the previous and current coefficient vectors and the smooth area is from time index 0 to k.

No smooth

0 ^{time index}

No smooth

0 ^{time index}

Figure 25: Linear smoothness of coefficient vectors

Because this disadvantage, the cosine curve seems to give smoother. Let the continuous smooth function Ψ(x) as:

k B A x

x = +

Ψ( ) cosπ . (19)

A and B will be defined the following equations. Ψ[n] must satisfy the following restriction,

which exhibit the continuous connection at both beginning and end points. By substitution, the equations can be

⎪⎩

Finally, Ψ[n] is defined as:

cos 2 ] 2

[ ⁺¹ + ⁺¹

− +

Ψ ⁱ ⁱ ⁱ ⁱ

n γ γ π n γ γ

. (23)

The cosine smoothness may support smoother curve. The diagram is shown in Figure 26.

F_i 0 k F_i+1

γ

] Ψ[n

time index

F_i 0 k F_i+1

γ

] Ψ[n

time index

Figure 26: Cosine smoothness of coefficient vector

After all, the resultant coefficient for the frame Fi+1 becomes a continuous function as

⎩⎨

⎧

∀

≤

∀

= Ψ

+ k n

k n n n

i ,

0 ], ] [

ˆ[

γ 1

γ . (24)

According to the subjective test, the “click” noise suffered from discontinuity is enhanced largely by the smooth process and the quality is improved.

4.4 Downmix Method Summary

This chapter has discussed the downmix procedure in PS coding. Also, we have proposed the “Karhunen-Loève Transform” to naturally avoid the weakness of averaging approach. Furthermore, the chapter has considered the perceptual artifacts generated in PS coding referred to as the tone leakage effect and tone modulation effect. The thesis has suggested the pre- and post-processing of KLT for reducing these artifacts. Figure 27 illustrates the block diagram for the KLT-based downmix approach.

Bit-stream Packing

Figure 27: Block diagram of the PS encoder with KLT-based downmix method

Chapter 5 Experiments

In this chapter, a lot of tracks are conducted for verifying the proposed approaches. The tracks are based on the MPEG test tracks and the music database collected in our lab. The experiments include both objective quality measurement and subjective measurement.

5.1 Experiment Environment

Computer Status:

Platform Personal Computer

Operating System Windows XP

CPU Intel Pentium 4 2.4GHz

Memory 256MB DDR400 * 2

Mother Board ASUS P4P800

Sound Card ADI AD1985 AC' 97

Headphone ALESSANDRO MUSIC SERIES PRO Objective Quality Measurement Tool:

For objective quality evaluation, the thesis mainly adopts the PEAQ system (perceptual evaluation of audio quality) [21] which is the recommendation system by ITU-R Task Group 10/4. The system includes a subtle perceptual model to measure the difference between two tracks. The objective difference grade (ODG) is the output variable from the objective measurement method. The ODG values should range from 0 to −4, where 0 corresponds to an imperceptible impairment and

−4 to impairment judged as very annoying. The improvement up to 0.1 is usually perceptually audible. The PEAQ has been widely used to measure the compression technique due to the capability to detect perceptual difference sensible by human hearing systems.

Subjective Quality Measurement Tool:

For subjective quality evaluation, the thesis mainly adopts the MUSHRA system [22]. The system allows the blind comparison of multiple audio files. Multi stimulus test with hidden reference and anchors has been designed to give a reliable and repeatable measure of the audio quality of intermediate-quality signals.

MUSHRA has the advantage that it provides an absolute measure of the audio quality of a codec which can be compared directly with the reference. MUSHRA follows the test method and impairment scale recommended by ITU-R BS.1116 [23].

5.2 Objective Quality Measurement in MPEG Test Tracks

The twelve test tracks recommended by MPEG are shown in Table 3. These tracks include the critical music balancing on the percussion, string, wind instruments, and human vocal. In this section, the quality enhancement of proposed methods at different bit rates is verified based on these MPEG test tracks and NCTU-HEAAC [24] is adopted as the platform.

Table 3: The twelve tracks recommended by MPEG Signal Description

Tracks

Signals Mode Time (sec) Remark

1 es01 Vocal (Suzan Vega) stereo 10 (c)

2 es02 German speech stereo 8 (c)

3 es03 English speech stereo 7 (c)

4 sc01 Trumpet solo and orchestra stereo 10 (b) (d)

5 sc02 Orchestral piece stereo 12 (d)

6 sc03 Contemporary pop music stereo 11 (d)

7 si01 Harpsichord stereo 7 (b)

8 si02 Castanets stereo 7 (a)

9 si03 pitch pipe stereo 27 (b)

10 sm01 Bagpipes stereo 11 (b)

11 sm02 Glockenspiel stereo 10 (a) (b)

12 sm03 Plucked strings stereo 13 (a) (b)

Remarks:

(a) Transients: pre-echo sensitive, smearing of noise in temporal domain.

(b) Tonal/Harmonic structure: noise sensitive, roughness.

(d) Complex sound: stresses the device under test.

Table 4: Objective measurements through the ODGs for proposed methods at 48 kbps

Codec NCTU-HEAAC

Bit Rate 48 kbps

Tracks M0 M1 M2 M3

es01 -1.54 -1.34 -1.52 -1.34

es02 -1.44 -1.43 -1.42 -1.41

es03 -1.63 -1.63 -1.63 -1.62

sc01 -3.37 -3.24 -3.29 -3.10

sc02 -3.12 -2.92 -3.06 -2.90

sc03 -2.58 -2.35 -2.59 -2.42

si01 -2.74 -2.56 -2.75 -2.58

si02 -2.51 -2.33 -2.45 -2.31

si03 -1.74 -1.66 -1.92 -1.69

sm01 -2.87 -2.66 -2.92 -2.70

sm02 -3.06 -3.06 -2.86 -2.76

sm03 -2.68 -2.46 -2.64 -2.43

Max -1.44 -1.34 -1.42 -1.34

Min -3.37 -3.24 -3.29 -3.10

Average -2.4400 -2.3033 -2.4208 -2.2717

M0: Fixed stereo parameter sets with averaging downmix approach

M1: Adaptive T/F stereo parameter extraction with averaging downmix approach M2: Fixed stereo parameter sets with KLT-based downmix approach

M3: Adaptive T/F stereo parameter extraction with KLT-based downmix approach

Figure 28: The variance in the ODGs of proposed methods at 48 kbps

Table 5: Objective measurements through the ODGs for proposed methods at 36 kbps

Codec NCTU-HEAAC

Bit Rate 36 kbps

Tracks M0 M1 M2 M3

es01 -2.37 -2.18 -2.32 -2.16

es02 -2.71 -2.71 -2.72 -2.71

es03 -2.81 -2.82 -2.82 -2.83

sc01 -3.41 -3.29 -3.34 -3.22

sc02 -3.27 -3.04 -3.28 -3.09

sc03 -2.86 -2.67 -2.88 -2.70

si01 -2.90 -2.73 -3.14 -3.00

si02 -2.70 -2.58 -2.65 -2.57

si03 -2.31 -2.11 -2.60 -2.36

sm01 -3.14 -2.90 -3.32 -3.04

sm02 -3.29 -3.29 -3.20 -3.14

sm03 -2.77 -2.55 -2.77 -2.56

Max -2.31 -2.11 -2.32 -2.16

Min -3.41 -3.29 -3.34 -3.22

Average -2.8783 -2.7392 -2.9200 -2.7817

M0: Fixed stereo parameter sets with averaging downmix approach

M1: Adaptive T/F stereo parameter extraction with averaging downmix approach M2: Fixed stereo parameter sets with KLT-based downmix approach

M3: Adaptive T/F stereo parameter extraction with KLT-based downmix approach

Figure 29: The variance in the ODGs of proposed methods at 36 kbps

Table 6: Objective measurements through the ODGs for proposed methods at 24 kbps

Codec NCTU-HEAAC

Bit Rate 24 kbps

Tracks M0 M1 M2 M3

es01 -3.43 -3.19 -3.45 -3.21

es02 -3.56 -3.47 -3.59 -3.47

es03 -3.74 -3.68 -3.73 -3.70

sc01 -3.48 -3.40 -3.43 -3.32

sc02 -3.30 -3.26 -3.28 -3.26

sc03 -3.17 -3.02 -3.28 -3.08

si01 -3.36 -3.00 -3.56 -3.26

si02 -3.35 -3.26 -3.30 -3.24

si03 -3.64 -3.03 -3.79 -3.38

sm01 -3.79 -3.46 -3.84 -3.62

sm02 -3.63 -3.59 -3.66 -3.65

sm03 -3.10 -2.86 -3.12 -2.89

Max -3.10 -2.86 -3.12 -2.89

Min -3.79 -3.68 -3.84 -3.70

Average -3.4625 -3.2683 -3.5025 -3.3400

M0: Fixed stereo parameter sets with averaging downmix approach

M1: Adaptive T/F stereo parameter extraction with averaging downmix approach M2: Fixed stereo parameter sets with KLT-based downmix approach

M3: Adaptive T/F stereo parameter extraction with KLT-based downmix approach

Figure 30: The variance in the ODGs of proposed methods at 24 kbps

The proposed methods are verified under the target bit-rate for PS coding.

Under each bit-rate, the thesis compares the quality of four conditions: fixed stereo parameter sets with averaging downmix approach as M0, adaptive T/F stereo parameter extraction with averaging downmix approach as M1, fixed stereo parameter sets with KLT-based downmix approach as M2, and adaptive T/F stereo parameter extraction with KLT-based downmix approach as M3.

As the result shows, adaptive T/F stereo parameter extraction is verified to increase the audio quality. However, KLT-based downmix approach only improves the quality at 48 kbps. That is, under the lower bit-rate, KLT-based downmix approach isn’t verified. From discussion of PS coding, the monaural downmix signal

在文檔中 Parametric Stereo Coding中參數抽取與聲道合成之設計 (頁 27-0)