MPEG Surround - 基於Gram-Schmidt正交化的MPEG Surround降混音器與去相關器之設計

Chapter 2 Backgrounds

2.2 MPEG Surround

Though Figure 1 and Figure 2 only show one of the coding tree structures provided by MPEG Surround, the components inside the tree structures are the same. Therefore, the introductions for each component are needed for the background knowledge.

Time/Frequency Transformation

The applied filterbank is a hybrid complex‐modulated quadrature mirror filterbank (QMF). As shown in Figure 1 and Figure 2, the input signals of the encoder and decoder are passed into time/frequency analysis filterbank, which is composed by a QMF analysis filterbank, subfilters and a delay shown in the left panel of Figure 4. The signals are first fed into the QMF analysis filterbank. Since each subband of QMF analysis filterbank has the same bandwidth, the frequency resolution of QMF analysis filterbank cannot response the sense of hearing model in which the lower frequency part requires higher resolution.

Therefore, subfilters are used to provide non‐uniform resolution in low frequency and the high frequency part is delayed by the delay module. In Figure 4, the input signal X is fed into the QMF analysis filterbank and turned to be the 64 QMF subband signal, Xqi with 0 ≤ i ≤ 64.

Then the 64 QMF subbands would be splitted to 71 hybrid subbands, Xmj with 0 ≤ j ≤ 71.

While calculating the spatial parameters at the mixing boxes, the 71 hybrid subbands are rearranged to 28 parameter bands, which is the basic unit of MPS, at most. The inverse time/frequency transformation is combined by the sum modules and a QMF synthesis filterbank, shown in the right panel of Figure 4. The modules of sum and QMF synthesis filterbank are corresponding to subfilters and QMF analysis filterbank respectively. Since the filterbank has the same structure as the one applied in Parametric Stereo, detailed introduction and information can be found in [2]‐ [4].

Figure 4 Structure of the time‐frequency transformation with its inversion

Framing

There are some restrictions for time resolution. There are 32 time slots in one frame with at most 8 parameter sets, which are the time slots that keep the calculated spatial parameters. The calculated parameter sets can be places at any time slot by variable framing or the time slots with equal distance by fixed framing. No matter which framing is used, at least one set of parameters are placed in the last time slot of a frame. The time slots that do not have the spatial parameters are then interpolated on the Ra matrix coefficients, which would be introduced in the next subsection.

Figure 5 illustrates a frame example with two parameter sets of a parameter band.

According to the previous paragraph in this subsection, each column indicates a time slot;

the columns with gray grounding are the time slot with spatial parameters; the red line indicates the matrix coefficient value; the bold solid line indicates the frame borders; and the blue line illustrates the interpolation of matrix coefficients.

Figure 5 A MPS frame with two parameter sets of a parameter band

Elementary Building Blocks

The common coding blocks for MPEG Surround are One‐To‐Two (OTT) and Two‐To‐Three (TTT) at the decoder side, and the corresponding blocks used on the encoder side are Two‐To‐One (TTO) and Reverse TTT (R‐TTT).

z Encoder

TTO coverts a stereo input signal to a mono signal, combined with parameter extraction which represents the spatial parameters between the respective input signals. Since [1] does not detailed specified how to realize downmixing for encoder, we reference the MPS reference software provided by MPEG [5] and know that the downmixing is realized by the direct summation of the input signals as

] [ ]

[ ]

[n X₁_, n X₂_, n

Y_m = _m + _m , (1)

where X1,m and X2,m denotes the input signals on the hybrid subband m; Ym denotes the output signal; n represents the time slot index.

In the appendix of [1], the power ratio of corresponding time/frequency tiles of the input signals, which would be denoted as “Channel Level Difference” or CLD, is defined as

⎟⎟

where mb is the hybrid subband boundaries of the bth parameter band.

Also, a similarity measure of the corresponding time/frequency tiles of the input signals, which would be denoted as “Inter‐Channel Correlation” or ICC, is given by the cross the Ra upmix matrix.

In literature, there are many decorrelation methods [6]‐[12]. In [7], decorrelation works on simple nonlinear functions, which is simple in computations. [8] uses interleaving comb filters and frequency shifts to realize decorrelation. [9] and [10] uses time‐varying all‐pass filters. Boueri et al. [11] proposes a new random time‐shifting method on critical band to lower the cross‐correlation. A Karhunen–Loève transform based method is also mentioned in [12] for decorrelation. Despite the fact there are many approaches for decorrelation, only the approach used for MPS decorrelator is discussed in this thesis.

Decorrelators are used to generate an uncorrelated signal from the input to simulate the missing channel(s) information for the upmix matrix operations. They are realized by

comprising a delay, a lattice all‐pass filter, and an energy adjustment stage shown in Figure 6. The configurations for the delay and all‐pass filter are controlled by the decorrelator configuration transmitted from encoder. For MPEG Surround, the OTT and prediction‐mode TTT have one decorrelator in both them.

Figure 6 Diagram of the original decorrelator on hybrid QMF domain signals

Let us detailed introduce how the original decorrelator operates. The delayed hybrid subband domain samples D_m^delay[n] are obtained as

⎪⎪

where Mm[n] for n < 0 contains the buffered values of the last frame at position 32 – n. Then the delayed hybrid subband domain samples D_m^filt[n] are filters as

⎟⎠

where S is the length of the lattice coefficient vector s[n] and the filter coefficients am[n] and bm[n] are derived from the lattice coefficient vector s[n], which is well‐defined in the Section 6.6 in [1].

After the D_m^filt[n] have been obtained for 0 ≤ n < 32, the following energy adjustment procedure is applied. The powers per parameter band k of the input samples E_k^M[n]

and the filtered samples E_k^D[n]

E_k^D^Smooth with n = 0 are initialized as zero vector. For the first slots of all other frames, both E_k^M^,^Smooth[n−1] Smooth k

with γ = 1.5 and ε = 1e ‐9. Finally, the decorrelator outputs are constructed as ]

The OTT matrix operation in Figure 2 is represented as

where Mm[n] is the downmixed hybrid subband signal; Dm[n] is the output signal of the decorrelator for Mm[n]; X’1,m[n] and X’2,m[n] are the upmixed signals. According to the parameters ICC and CLD, MPS standard specifies Ra as

( ) ( )

The matrix operations used for TTT have different modes, prediction mode and energy reconstruction mode, which are fully introduced in [3].

Tree Structures

The common coding blocks for MPEG Surround mentioned in the previous subsection can be cascaded to form the desired spatial coding tree, depending on the specified numbers of inputs and outputs, and additional features. Though MPS can deal with the sequences with 7.1 channels, only the most common tree structures for 5.1 channel input would be described in this subsection.

The first type of tree structures for 5.1 channel input is 5151 and 5152, which supports a mono downmixed signal. Both 5151 and 5152 are shown in Figure 7a and Figure 7b respectively. The six input channels, left front (L), right front (R), left surround (Ls), right surround (Rs), center (C), and low frequency enhancement (LFE) are fed into TTO box pairwisely until a mono downmixed signal is obtained.

To downmix the inputs to the stereo signal, the second type of tree structures is provided, which is commonly known as 525. Figure 8 shows the preferred tree structure for stereo downmix. L and Ls, R and Rs, and C and LFE are processed by R‐OTT box separately and then the outputs of the three TTO boxes are fed into a R‐TTT box to generate a stereo downmixed signal.

To upmix the downmixed signal, the decoding tree structures are the inversion of the encoding one, which would give the most similar results. For detailed information, please refer to [1]‐[3].

Figure 7 Tree configurations for mono downmix [2]

Figure 8 Preferred tree configuration for stereo downmix [2]

(a) 5151 configuration (b) 5152 configuration

在文檔中基於Gram-Schmidt正交化的MPEG Surround降混音器與去相關器之設計 (頁 14-23)