Chapter 2 Basic Components
2.1 Parametric Stereo Coding Concept
2.1.4 Stereo Processing
The stereo processing module, which is also called mixing procedure in the PS Coding, is to reconstruct the stereo signal by the downmix monaural signal, the de-correlation signal, and the delivered stereo parameters. For each sub-subband, they are reconstructed as:
)
In (1), m(k,n) is the n-th sample of the downmix monaural signal in sub-subband k, and d is indicated the de-correlation signal. Consequently, H is the mixing matrix and the mixing procedure calculates this matrix by the delivered stereo parameters.
From the de-correlation procedure, the correlation between m and d is nearly zero and their energies are the same because of the all-pass filter. Thus, to keep the energy and correlation in reconstructed signal, H is demanded the restriction as:
⎪⎪
where l2 and r2 are the energy of left and right channel, ρ is the correlation of stereo signal. If the energy of the monaural signal is the average energy of two channels, the first two restrictions of H can guarantee the conservation of energy for two channels.
In PS draft [10], there are two mixing procedures, which are named Ra and Rb, are introduced. Both two approaches comply with the restriction of mixing matrix and reconstruct the signal to keep the original energy ratio and correlation between the stereo channels. The difference between the two approaches is the strategy for using the de-correlation signal. To compare H21 and H22 under the two approaches, there is a negative sign difference. Thus, the de-correlation signal effects on two channels under two approaches are reversed. Figure 9 illustrates this difference. The x axis denotes the energy ratio, y axis is correlation of signal, and z axis indicates the value of mixing matrix.
-7-5
0-0.25 0.25-0.5 0.5-0.75
Ra
-1--0.75 -0.75--0.5 -0.5--0.25 -0.25-0
Rb
-0.75--0.5 -0.5--0.25 -0.25-0
Ra
0-0.25 0.25-0.5 0.5-0.75 0.75-1
Rb H21
H22
Figure 9: Difference between two mixing procedures
Chapter 3
T/F Stereo Parameter Extraction
3.1 T/F Stereo Parameter Extraction Concept
As mentioned above, in accordance with the information in delivered stereo parameters the PS tool reconstructs the stereo signal from monaural signal. In each frame, there are 32 time slots in time domain and many stereo bands in frequency domain. This T/F resolution is too high to deliver the stereo parameters in each frame. Consequently, the PS draft [10] limits the number of stereo parameter delivery as discussion in chapter 2. In each frame, there can be at most four time borders for time domain and three types of stereo band for frequency domain can be selected. In other words, the number of stereo parameter sets send in one frame is from zero, when there is no time border in the frame, to 136, which is under four time borders and 34 stereo bands. The design issue of T/F stereo parameter extraction module is the way to decide the setting of time borders and stereo band.
3.2 Existing Approaches in Other Codecs
In the literature, there have been limited publications on the stereo parametric extraction. The code in 3GPP [13] uses basically a fixed stereo parameter set, which always uses fixed stereo band resolution with two time regions, each region is half of frame, in each frame. Similarly, to dump the encoded tracks which are encoded by NERO 7.0.8.2 [14] or Coding Technologies 7.0.5 [15] can be found out these tracks always use fixed stereo band resolution with only one time regions in each frame. All these codecs do not make adaptive decision on the T/F stereo parameter extraction. That is, they only update the stereo parameter sets regularly without any decision. Because this approach does not reflect the content of signals, it might have some disadvantages. For stationary signal, because the signal is steady, the stereo parameters in same time slot should be similar to the previous ones. Under this situation, there could be no time border assigned to these kind frames. Thus, each frame follows the stereo setting in the previous frame. Nevertheless, the fixed stereo
parameter extraction for stationary signal updates the stereo parameters too often and so the bits used here are wasted. On the contrary, the transient signal or varied signal might need many stereo parameter sets to record the variation of the signal.
The merit of the fixed approach is the low complexity but the method has not reflects the signal contents.
3.3 Adaptive T/F Stereo Parameter Extraction
As illustrated in Figure 10, the thesis segments “T/F Stereo Parameter Extraction” into two modules: “Region Decision” (RD) for time domain and “Stereo Band Decision” (SBD) for frequency domain which adapt to the signal content.
Both region decision and stereo band decision are the control modules. In the block diagram, their outputs are dotted lines to original PS encoder for controlling the relative modules. Region decision is designed to find the time borders for updating the stereo parameter sets. The other module, Stereo band decision, decides the frequency resolution and informs “Hybrid Filterbank Analysis” module to split lower QMF subbands to achieve a higher frequency resolution. These two modules decide the T/F resolution in a frame.
Bit-stream Packing
Encoder AAC Encoder
Encoder AAC Encoder
Figure 10: Block diagram of the PS encoder including RD and SBD modules
3.3.1 Region Decision
First, we consider the way the PS decoder reconstructs the stereo signal by using the monaural signal and the delivered parameter sets. In each stereo band of a time region, there is only one stereo parameter set assigned to the last time slot in the region and the parameter sets of other time slots in the region are assigned by means of interpolation. Our algorithm is to find the time regions and its border positions by the content of stereo signal with a well-controlled reconstruction error.
3.3.1.1 Stereo Parameter Estimation and Error Calculation
Before searching time regions in a frame, the stereo parameter for each data sample, which is in a subband of a time slot, should be built. Because the stereo parameter is a statistical characteristic, only one sample can not be extracted with the stereo parameters. The estimation method here is to window the nearly data samples for obtaining the approximate stereo parameters. For example, to calculate an approximate inter-channel intensity difference (IID), the formula is
∑
t indicates time index and b is frequency index. w is the window sequence with length L. l and r are the left and right channels. After parameter estimation, because there are different types of stereo parameters, the same kind of estimated stereo parameters should be normalized firstly. Thus, each type of stereo parameter has zero mean and equal standard derivation and so different type of parameter can be averaged together. The normalized formula is
))
where x means a type of stereo parameter, and σ(xt,b) is the standard derivation of xt,b. By this estimation, the reconstruction error can be calculated in each possible region.
There are two types of parameter generating for unassigned time slot in a region. If a time region has time borders as its boundaries, the parameter generating is by means of linear interpolation. However, a region also can be defined as the space between time border and frame boundary, the parameters for unassigned time slots are same as the previous time border until the end of frame. These two region types are arranged in Table 1, whereΔi,j(b) indicates the linear interpolation slope for type-I.
Table 1: Two region types for error calculation
Type-I: Linear interpolation Type-II: constant interpolation
Description Region between two time borders
Region between time border and frame
boundary
By the reconstruction error of each possible region, any given region set can be used to calculate the reconstruction error. First, the search space can be defined as
} region set conforms to this space is a possible solution for region decision. Also, the objective function for a region set is given as
∑
=which means the error summation of all regions in this region set. By above formulas, the optimal solution is
)]}
Unfortunately, using the “Bruce-Force Method” to search the optimal solution is too complexity. The number of total possible region sets is
41449
1+C132 +C232 +C332 +C432 = . (8) Therefore, an efficient search algorithm should be adopted for solving this problem.
The thesis will introduce a method based on dynamic programming to substantially decrease the coding complexity.
3.3.1.2 Dynamic Programming
Dynamic programming (DP) is a subset of the general theory concerned with discrete sequential decisions. The varied history and an extensive bibliography can be found in the article by Silverman and Morgan [16]. The prolific application of the DP to various fields has been credited to Professor Richard Bellman [17]. The
basic principle of DP is to break an optimization problem down into stages of decisions that follow a criterion leading to a recurrence relations. For audio compression, we have already applied the DP algorithm to efficiently design the MS coding, and Huffman code book search [18]. This thesis will consider the applying of the DP algorithm to efficiently search the time borders.
3.3.1.3 Region Decision based on Dynamic Programming
Applying dynamic programming to region decision, we need to define the problem into a decision problem and break the problem into stages of the recursive decision of sub-problems. Above all, we define the symbols as follow
s i i-th inner time borders. to time slot j under the time region set with k inner time borders and there is a border at time slot i-1.
Ei the minimum construction error of parameters in the range from time slot i to time slot j among all possible time region sets with k inner borders and there is a border at time slot i-1.
By above symbol definitions, the minimum construction error can be written as 0 sub-structure of Ei(,kj)can be explored as follow. Assume the optimum k borders ares1 ,'s2',...,sk', we have illustrates this condition.
}
Time
Frequency
i
Number of border = k
…
k-1 borders
j t
Time
Frequency
i
Number of border = k
…
k-1 borders
j t
Figure 11: Dynamic programming in Region Decision
) 0 (
, j
E
i) (b xj L
Time
Frequency … …
) (b xi
) 0 (
, j
E
i) (b xj L
Time
Frequency … …
) (b xi
Figure 12: Ei(, j0), the time region from time slot i to time slot j
Each Ei(,kj) can be recursively constructed for all k>0. Therefore, we need to calculate Ei(, j0) for all i and j>i at initialization of dynamic programming. Figure 12 illustrates Ei(, j0), where b is the QMF subbands index, x is the variable which indicates the stereo parameters value, and L is the length of Ei(, j0).
Begin Error ≤ Threshold Error ≤ Threshold
k++ Error ≤ Threshold Error ≤ Threshold
k++
Figure 13: Flow chart of DP in Region Decision
The flow chart of dynamic programming is illustrated in Figure 13. The dynamic programming algorithm calculates the minimum errors from the case without inner border, up to the case with four inner borders in the frame. Also there is a termination threshold shown in Figure 13. The number of regions increases with the decrease of reconstruction error with overhead from the bits usage in a frame. By this reason, the optimal solution might lead to too much bit overhead. That is to say, although searching the minimum reconstruction error gives the optimal solution for region set, the coding quality might be influenced. Under the requirement of quality and the limited available bits, the threshold is useful to provide this condition.
Therefore, instead of finding the regions with minimum reconstruction error, the algorithm uses a threshold to stop the DP search and finds the wanted region set with the minimum reconstruction error among all region sets under the smallest region number in which the error is firstly less than the threshold. Moreover, this threshold must relate with bit-rates. By our tuning result, the thesis suggests three thresholds under target bit-rates: 48, 36, and 24 kbps. For other bit-rates, it uses interpolation method to get relative threshold.
Table 2: The relative threshold for DP under three target bit-rates Bit-rate (kbps) 48 36 24
Relative Threshold 1.4 1.5 1.6
However, this threshold needs to be conservative, and hence there could be risky in some tracks. This thesis suggests an aggressive threshold for avoid the risk. If there are successive frames which have high reconstruction errors but can not achieve the
threshold, our algorithm will not update the stereo parameters in these frames. The aggressive threshold is used here to coercively update the stereo parameters to avoid this situation. Another method is to update parameters regularly. This method also can solve the seek problem for playback usage. Figure 14 illustrates the complete flow chart of region decision based on the dynamic programming.
Stereo Parameter
Error ≤ Threshold or k = 4 Error ≤ Threshold
or k = 4
k ++
& Error Calculation for Next Loop
k ++
& Error Calculation for Next Loop No
Error ≤ Threshold or k = 4 Error ≤ Threshold
or k = 4
k ++
& Error Calculation for Next Loop
k ++
& Error Calculation for Next Loop No
Figure 14: Complete flow chart of Region Decision
3.3.2 Stereo Band Decision
In stereo band decision module, it controls the hybrid analysis filterbank to decide the resolution for frequency domain. The 64 QMF subbands might be divided to 71 or 91 sub-subbands and combined into 10, 20, or 34 stereo bands. Since only one stereo parameter set can be updated for a stereo band in a time region, the QMF subbands in the same stereo band are supposed to share same stereo parameter set.
That is, these QMF subbands should have similar stereo characteristic. By this viewpoint, the similarity of subbands should be measured.
stereo band k
…
Region 1 Region 2 Region 3 Time i
i+1
QMF subband j
stereo band k
…
Region 1 Region 2 Region 3 Time i
i+1
QMF subband j
Figure 15: Stereo Band Decision in stereo band k
Figure 15 illustrates a stereo band k, which contains QMF subbands which are indexed as i, i+1, … to k. In this example, there are three time regions in stereo band k. For each time region and QMF subband, it firstly estimates a stereo parameter set for region T and QMF subband b. Like the discussion in region decision, the stereo parameters here also need to be normalized. The normalized method is same as (4).
Using these normalized parameters, SBD module calculates their variance in each stereo band containing more than one QMF subband. If the variance in this stereo band is small enough, this QMF subbands in this stereo band have the similar stereo parameters and can be combined. Therefore, these variances are the measurement for similarity. Finally, to sum up these variances under different types of stereo band resolution, the smallest one indicates that this frequency resolution is the most suitable for content of signal. Figure 16 illustrates the flow chart of SBD module.
Stereo Parameter Estimation Stereo Parameter
Estimation
Variance Calculation Variance Calculation Begin
Begin
DoneDone Stereo Parameter
Normalization Stereo Parameter
Normalization
Variances Summation under Same Resolution Variances Summation under Same Resolution
Resolution Decision Resolution Decision Stereo Parameter
Estimation Stereo Parameter
Estimation
Variance Calculation Variance Calculation Begin
Begin
DoneDone Stereo Parameter
Normalization Stereo Parameter
Normalization
Variances Summation under Same Resolution Variances Summation under Same Resolution
Resolution Decision Resolution Decision
Figure 16: Flow chart of Stereo Band Decision
3.4 T/F Stereo Parameter Extraction Summary
From above sections, the thesis introduces the existing method, which uses the fixed stereo parameter sets, and suggests an adaptive T/F stereo parameter extraction to avoid the lack of existing method. This adaptive method has been already published in [19]. The quality measurement will be shown in Chapter 5 to assess the improvement of this method.
Chapter 4
Downmix Method
4.1 Downmix Method Concept
The purpose of PS coding is to combine the stereo signal into a monaural signal with some stereo parameters. Because of that, bit-rate can be greatly reduced.
Moreover, in such low bit-rate, only few bits provide PS usage, the most bits suppose the encoding of monaural signal to keep the certain coding quality.
Therefore, the monaural downmix signal is the essential quality source for the reconstructed stereo signal. The design issue of downmix method seems to be the way to preserve the information of the original stereo signal. Furthermore, in Chapter 2 the mixing procedure suggests some restrictions of the downmix signal. In downmix method, it should also consider these restrictions. Following sections will firstly discuss the averaging approach and then suggest an approach to avoid the problems in the averaging approach.
4.2 Averaging Approach
In PS draft [10], the monaural downmix signal is generated according to 2
R
M = L+ . (14)
This approach only calculates the average signal of the stereo input signal. Therefore, there might be energy cancellation problem in the downmix signal illustrated in Figure 17. If two channels are anti-phase and have same magnitude, the monaural downmix signal will be totally cancelled. Under this energy cancellation, the PS tool can not reconstruct the original signal. In other words, the stereo information between two channels might be lost. Also this problem violates the restriction of mixing procedure which demands the energy of downmix signal is the average energy of the stereo signal to keep the conservation of energy.
Averaging
downmix approach
Figure 17: Energy cancellation problem in averaging approach
The method adopted in 3GPP [13] is based on the averaging approach. It uses a post-processing to cater for the demand of mixing procedure. An energy adjusting scale is used in the post-processing. This method seems to procure the mixing restriction, but it also can not save the stereo information which is disappeared in the average signal. Besides, the destroyed spectrum structure which faces cancellation problem is still same as before when using scaling method. Therefore, it is easy to implement the average approach but this approach which does not conform to the signal content would wreck the signal structure.
4.3 Downmix Method based on Karhunen-Loève Transform
As mentioned above, the design issue of the downmix method is to preserve the most information of original stereo signal. In the following sections, the thesis will introduce an approach based on the Karhunen-Loève Transform.
4.3.1 KLT-based Approach
Our goal here is to generate the features via linear transforms of the stereo signal. The basic concept is to transform a given set of samples to a new set of features. Karhunen-Loève Transform (KLT) [20] is such an approach. Its transform domain features can exhibit high information packing properties. This means that the most of the classification-related information is squeezed in a relatively small number of features, leading to a reduction of the necessary feature space dimension.
To adopt KL transform as the downmix coefficient vector, it firstly defines some symbols as follow:
U2×32 sample matrix which contains samples in two relative subbands V2×32 transformed matrix
RU,RV auto-correlation matrixes of sample matrix and transformed matrix Φ2×2 KL transform matrix
By above symbol definitions, the transform equation can be written as:
U
V = Φ
T . (15)The KL transform matrix Φ is given by orthonormal eigenvectors of RU. After KL transform, RV will be an uncorrelated matrix. Because the transform is used for downmix usage, the resultant sample set is the row in V with the eigenvector which corresponds to the large eigenvalue. In other words, the eigenvector with the large eigenvalue is the downmix coefficient vector to generate the monaural signal.
Therefore, it is the optimal transform in terms of energy compaction because it makes basis vectors uncorrelated and orthogonal. Figure 18 shows a simple view for KLT flow path. It firstly build auto-correlation matrix from two relative subbands and then calculate the wanted eigenvector. The eigenvector is delivered to downmix method for generating the monaural signal. The block diagram which illustrated the PS encoder with KLT module is shown in Figure 19.
Build Auto-correlation
Matrix Build
Matrix Build