Chapter 2 Spatial Hearing
2.4 Conclusions
The main point emphasized here is that the perceptual direction of a sound source is determined by ITD, ILD, ICTD, and ICLD. The other parameters, IC and ICC, are used to measure the width (spatial diffuseness) of perceived auditory spatial image. These parameters also play an important role for capturing and generating sound in spatial audio systems, such as stereo or multi-channel audio playback.
Chapter 3
MPEG Surround
In this chapter, we will briefly review several stereo and multi-channel algorithms which are related to the MPEG surround coding, including Intensity Stereo Coding (ISC), Parametric Stereo coding (PS), and Binaural Cue Coding (BCC). Then we will introduce the basic concepts and major modules of the MPEG Surround coder. It is used for compressing a multi-channel audio signal at very low bitrate. It provides an extremely efficient method for coding multi-channel sounds. Finally, an inter-connection of MPEG Surround and MPEG-4 HEAAC structure will be described.
3.1 Related Techniques
3.1.1 Intensity Stereo Coding [3]
Intensity stereo coding (ISC) is a joint-channel coding technique that is a part of the ISO/IEC MPEG family of standards [4]. By removing perceptually irrelevant information between audio channel pairs, it reduces the bit-rate needed for encoding stereo or multi-channel signals. It is more efficient than coding of each channel separately. ISC exploits the fact that the human hearing system is sensitive to low frequency signals at both amplitude and phase; it also sensitive to amplitude of high frequency signals, but low sensitive to phase.
Thus, at high frequencies, the original left and right subband signals are replaced by a sum signal and a direction angle (azimuth) which controls the intensity stereo position of the auditory event created at the decoder.
3.1.2 Parametric Stereo Coding [5] [6]
Since ISC is prone to aliasing artifacts and typically is only applied for higher frequency bands, Parametric Stereo (PS) technology is proposed to overcome these limitations. PS is standardized in MPEG-4 and is the next major step to enhance the efficiency of audio compression for low bit-rate stereo signals. It is in conjunction with the context of the MPEG-4 HE-AAC (aacPlus) codec, known as HE-AAC v2, or Enhanced aacPlus.
PS employs a dedicated (complex-valued, over-sampled) filter bank to avoid artifacts due to aliasing resulting from the spectral modification in generating the output channels. In additions, it synthesizes not only the intensities but also the phase differences and coherence between the output channels. Due to these improvements, the PS tool can operate on the full audio bandwidth. In such a system, the stereo signal (a pair of signal) is reconstructed from the transmitted mono signal with the help of the stereo parameters.
3.1.3 Binaural Cue Coding [7] [8]
The Binaural Cue Coding (BCC) approach can be viewed as a generalization of the parametric stereo idea, delivering multi-channel output (with an arbitrary number of channels) from a single audio channel plus some side-information. Figure 3.1 shows the generalized block diagram of BCC encoder and decoder.
x1
x2
xN
ˆx1
ˆx2
xˆN
s
Figure 3.1: Generic Scheme for binaural cue coding (BCC)
In the encoder, multi-channel input channels are combined into a single sum signal by using a downmix process. At the same time, the multi-channel sound image is extracted and parameterized as BCC side-information. The decoder is able to reproduce multi-channel output signals by these data. Because BCC requires only a few bit-rates (2 kb/s) to encode the side-information, the total bit-rate is only slightly higher than what is required to represent a mono audio signal. Another advantage of this scheme is its backwards compatibility with non-multi-channel audio systems. For the receivers that do not support multi-channel sound audio, it simply ignores the side-information and decodes the sum signal.
3.1.3.1 Estimation of BCC parameters
As shown in Figure 3.2(a), the BCC parameters including the inter-channel level difference (ICLD), the inter-channel time difference (ICTD), and the inter-channel coherence (ICC) are estimated in the subband domain. The estimation process is applied independently to each subband.
)
1(n x
)
2(n x
) (n xN
)
1(k X
)
2(k X
) (k XN
(a) (b)
Figure 3.2: BCC parameters estimation
Figure 3.2(b) shows an example of 5-channel environment. The ICTD and ICLD between a reference channel (e.g. Left channel) and the other channels are estimated. One single ICC is estimated between the channel pair with the largest power, to describe the overall coherence among all audio channels.
3.1.3.2 Synthesis of BCC parameters
BCC synthesis scheme is shown in Figure 3.3. First, the downmixed sum signal is converted into the frequency domain via a filter bank. For each output channel, individual time delays and scale factors are imposed on the spectral coefficients to re-synthesis ICTD and ICLD respectively. Followed by a coherence synthesis process, ICC is synthesized.
Finally, all output channels are converted back into the time domain signals.
t1(k) FB
IFB IFB IFB
hN(k) h2(k) h1(k)
tN(k) t2(k)
a1(k)
a2(k)
aN(k) )
(n
s S(k)
) ˆ (
1 k X
) ˆ (
2 k X
) ˆ (k XN
) ˆ1(n x
) ˆ2(n x
) ˆ (n xN
Figure 3.3: BCC synthesis
3.2 MPEG Surround
MPEG Surround can be viewed as an enhancement of the techniques we previously mention, such as a multi-channel extension of Parametric Coding or a generalized version of BCC. In the following sections, we will describe the standardization process of MPEG Surround and its structure.
3.2.1 MPEG Surround Standardization Process
Motivated by the demonstrated potential of what was then called the Spatial Audio Coding approach, ISO/MPEG started a new work item on the parametric coding of multi-channel audio signals by issuing a CfP (Call for proposal) on Spatial Audio Coding in
March 2004 [9]. Four responses were received and evaluated with a number of performance measures including subjective quality of the decoded multi-channel audio signal, the subjective quality of the downmix signals, the spatial cue side information bitrate and the other parameters, such as additional functionality and computational complexity.
As a result of these extensive evaluations, MPEG committee decided that the technology that would be the starting point in the standardization process, called Reference Model 0 (RM0), would be a combination of the submissions from two proponents: Fraunhofer IIS/Agere Systems and Coding Technologies/Philips. These systems not only outperformed the other submissions but also showed complementary performance in terms of per-item quality, bitrate and complexity. Consequently, the merged RM0 (now called MPEG Surround) is designed to combine the best features of both individual systems and was found to fully meet the performance expectation. RM0 provides sound quality substantially the surpassing existing matrixed surround solutions, even for the transmission of a mono downmix signal or for the spatial cue bitrates as low as 6kbit/s. It serves as the basis for the further technical development within the MPEG-4 audio. An extended description of the technology can be found in [10] and [11].
3.2.2 MPEG Surround Reference Model 0 Scheme
Rather than performing a discrete coding of the individual audio input channels, Spatial Audio Coding is a technique to efficiently code a multi-channel audio signal as stereo (or even monaural) signal plus a small amount side information for multi-channel spatial image parameters. Figures 3.4 and 3.5 show the block diagram of the MPEG Surround RM0 encoder and decoder, respectively. The input signals are processed by the analysis filter banks to decompose the input signals into separate frequency bands. The frequency selectivity of these filter banks is tailored specifically towards mimicking the frequency resolution of the human auditory system. Then the MPEG Surround encoder captures the spatial image of a multi-channel audio signal and condenses it into a compact set of parameters. These
parameters typically include level/intensity differences and measures correlation/coherence between the audio channels. In parallel, a stereo (or monaural) downmix signal of the sound material is created. The downmix signal is transformed back to the time-domain signal by using the synthesis filter banks. And it is transmitted to the decoder together with the spatial information. On the decoder side the transmitted downmix signal is expanded into high quality multi-channel outputs based on the known spatial parameters.
s1
s2
x1
x2
xN
Figure 3.4: MPEG Surround Encoder Overview [15]
ˆx1
ˆx2
xˆN
ˆs1
ˆs2
Figure 3.5: MPEG Surround Decoder Overview [15]
Moreover, to achieve a higher compression rate, a MPEG Surround Coding can be combined with a conventional state-of-the-art coder (Audio Encoder and Audio Decoder in Figures 3.4 and 3.5). The downmix signal is encoded with a core coder such as the MPEG-1
Layer III (mp3), MPEG-2/4 AAC or MPEG-4 High Efficiency AAC, or it could even be PCM.
In this way, MPEG Surround coder acts as a pre-process to the audio encoder, and as a post-process to the core decoder. Thus, the MPEG Surround Coding is able to provide complete backward compatibility with the non-multi-channel audio systems using the downmix signal: A receiver device without a MPEG Surround decoder will simply decode and present downmix signal.
3.2.3 Time to Frequency Transform
In the human auditory system, the processing of binaural cues is performed on a non-uniform frequency scale. Since the spatial parameters are estimated (at the encoder side) and applied (at the decoder side) as a function of time and frequency, both the encoder and decoder require a transform or filter bank that resemble this non-uniform scale. Furthermore, the transform or filter bank should be over-sampled, since time- and frequency-dependent signal modifications will be made to the signals which would lead to audible aliasing distortion in a critically-sampled system.
It employs a two-stage filter bank to satisfy the above requirments. Figure 3.6 and Figure 3.7 shows the structure of the hybrid QMF analysis and synthesis filter banks, respectively.
The first-stage filter bank is a complex-modulated Quadrature Mirror Filter (QMF) bank to obtain a uniform, over-sampled, frequency representation of the audio signal. The signals of the lowest QMF subbands are subsequently fed through a second complex-modulated filter bank to provide a higher resolution of low frequencies.
)
Figure 3.6: Hybrid QMF analysis filter bank providing 71 output bands. The input is fed into a 64-band analysis QMF bank (dashed box). The three lower QMF subbands are further split to increase low frequency resolution (see shadowed box).
M
Figure 3.7: Hybrid QMF synthesis filter bank using 71 input bands. The low frequency coefficients are simply added (see shadowed box) prior to the synthesis with the QMF.
3.2.4 Analysis Quadrature Mirror Filter (AQMF) Bank
The first filter bank is compatible with the filter bank used in the SBR algorithms [17].
The subband signals are generated by this filter bank are obtained by convolving the input signal with a set of analysis filter impulse response hk[n] given by:
−
+
= 2
1 2
exp 1 ] [ ]
[ k n
M n j
p n
hk π
,
where p[n] represents the low-pass prototype filter impulse with 640 filter length, M represents the number of frequency bands (M=64) and k, the subband index (k=0,…,M-1).
The filtered outputs are subsequently down sampled by a factor M resulting in the down-sampled QMF outputsXk[n]=(x∗hk)[Mn].
The equation above is purely analytical. In practice, the computational complexity can be reduced by using the poly-phase decomposition method as described in the following steps, in which an array x consisting of 640 time domain input samples are assumed. Higher indices in the array correspond to older samples. Figure 3.8 shows the QMF analysis window.
core coder samples
0 1024 2048
0
core coder signal 640 samples
2624 576
31
(frame size 2048)
29 30 2
1
Figure 3.8: QMF analysis windowing [17]. Index 0 to 31 represent 32 windows.
The QMF process is as follows.
1. Shift the samples in the array x by 64 positions. The oldest 64 samples are discarded and the 64 new samples are stored in positions 0 to 63.
2. Multiplying the samples of array x by window c to array Z (Z[n]=x[n]×c[n], for n=0 to
639). The 640 window coefficients c are showed in Figure 3.9.
3. Sum the samples according to the formula,
∑
=
4. Calculate 64 new subband samples by the matrix operation X=Mu, where
In the equation, exp() denotes the complex exponential function and i is the imaginary unit.
0 100 200 300 400 500 600
Figure 3.9: Coefficients of the QMF bank window
Every loop produces 64 complex-valued subband samples, representing the output from one subband. For every frame, the filter bank produces 32 subband samples for every subband, corresponding to a time domain signal of length 2048 samples.
The magnitude responses of the first 4 frequency bands (k=0… 3) of the QMF analysis bank are illustrated by Figure 3.10.
Figure 3.10: Magnitude responses for the first 4 band of the QMF analysis filter bank.
(The magnitude for k=0 is highlighted)
3.2.5 Hybrid Filterbank for improved frequency resolution
At a sampling rate of 44.1 kHz, the 64-bands analysis filter bank results in an effective bandwidth of approximately 344 Hz, which is considerably wider than the required spectral resolution at low frequencies. In order to further improve the frequency resolution, the lower QMF subbands are further split with an additional filter bank based on oddly-modulated Qth band filter banks. Depending on the QMF subband, two types of filter have been defined.
Table 3.1: Overview of low frequency split
QMF subband p Number of bands Qp Filter
whereg represents the 12-order prototype filters in QMF subband p, p Q the total number p of sub-subbands in subband p, q the sub-subband index in QMF channel p and n the time index.
According to Table 3.1, the first three QMF bands perform sub-subband filtering with Q0=8, Q1=2, and Q2=2. The remaining QMF subbands that are not filtered are delay compensated. This delay amounts to 6 QMF subband samples (i.e. G0k(z)=z−6for k=
[3…63]). Besides, in order to further reduce the complexity of this configuration, some of the filter bank outputs have been summed and resulting total 71 output bands. As an example, the magnitude response of the 8-band sub-filter bank is given in Figure 3.11.
Figure 3.11: Magnitude response of the 8-band sub-filter bank. (subband q=0 is highlighted)
3.2.6 Subband Partition
To reduce the complexity and the bitrate of spatial parameters, 71 subband signals are grouped into fewer parameter bands. Since the spatial parameters vary over time and frequency, the psychoacoustic data indicate that a Bark or Equivalent Rectangular Bandwidth (ERB) like frequency scale is appropriate for spatial parameters in the following equation:
) 1 00437 . 0 ( 7 . 24 )
(f = f +
ERB ,
where f is the center-frequency in Hz. Hence, the subband coefficients are non-uniformly
divided into 20 individual parameter bands according to such a perceptual frequency scale. In each parameter band one set of spatial parameters will be separately estimated and transmitted to the decoder.
3.2.7 Spatial Audio Parameters
The Spatial Audio Coding systems employs two conceptual modules with which it is possible to describe any arbitrary mapping from M to N channels and back, with N<M. The structure of the system divides the input channels into pairs of channels that are coded with modules which take two input channels, and produces one output channel (a Reverse One-To-Two module, R-OTT). However, there are also modules that take three input channels and produce two output channels are available (a Reverse Two-To-Three module, R-TTT).
3.2.7.1 R-OTT module
The purpose of the R-OTT module is to create a mono downmix from a stereo input and extract the relevant spatial parameters. For each frequency band, two parameters are computed (assuming input signals x1 and x2).
Channel Level Differences (CLD) – They represent the power ratio of corresponding time/frequency tiles of the input signals, given by:
.
Inter-channel coherence/cross-correlation (ICC) – They represent the similarity measure of the corresponding time/frequency tiles of the input signals, given by:
.
3.2.7.2 R-TTT module
The R-TTT module performs a downmix from three (l, c, r) to two (l0, r0) downmix channels, combined with the generation of the associated spatial parameters. It is appropriate for modeling the symmetrically downmixed center from a stereo downmix pair. The TTT module has two modes of operation.
TTT Energy Mode
The first method comprises an energy-based parameterization, using channel level difference (CLD) parameters.
. represented by the energy ratios CLD1 and CLD2.
TTT Prediction Mode
A second mode of operation for the R-TTT module is based on transmission of up-mix matrix directly. It makes use of the following parameters.
Channel Prediction Coefficients (CPC) – The general purpose of the TTT module at the decoder side is to generate three signals from the transmitted stereo downmix signal pair.
If the up-mix process from the stereo pair (l0, r0) to the signal triplet (l’, r’, c’) is written in matrix notation, the up-mix matrix MCPC is given by:
Here, c1 and c2 represent the transmitted CPC parameters. The CPC parameters aim at an optimum reconstruction of the spatial parameters of the signals l’, r’, and c’ (compared to the spatial parameters of the corresponding signals at the encoder side). In other words, given the up-mix matrix MCPC, the parameters c1 and c2 can be optimized according to several optimization criteria. However, the recovered signal will in general consist of only partially correlated signals. Therefore, there will be a prediction loss.
Inter-channel coherence/cross-correlation (ICC) – Unlike the R-OTT module, the ICC parameter here describes the prediction loss cause by the CPC parameters. It is the power ratio between the R-TTT module input and reconstructed signals:
'*
Here, <.>denotes the expected value operator, and * denotes complex conjugation.
3.2.7.3 Hierarchical Configuration
To encode 5.1 surround sounds into two-channel stereo in particularly attractive in view of its backward compatibility with the existing stereo consumer devices. Figure 3.12(a) shows a block diagram of a 5.1-to-2 encoder for such a typical system consisting of three R-OTT and an R-TTT module. The signals L, Ls, C, LFE, R and Rs denote the left front, left back, center, LFE, right front and right back channels, respectively. Another example is illustrated in Figure 3.12(b), which shows how the R-OTT modules can be connected in a tree structure, forming a 5.1-to-1 encoder.
(a) (b)
Figure 3.12: (a) 5.1-to-2 MPEG Surround encoder (b) 5.1-to-1 MPEG Surround encoder
3.3 Combination of MPEG Surround and HE-AAC
As mention previously, the MPEG Surround coder can be connected to a state-of-the-art coder. Since the MPEG Surround coder interfaces to the downmix channels by means of a QMF-domain representation, identical to that standardized in the SBR tool of MPEG-4 HEAAC. In the case that the spatial coder is combined with HEAAC, this QMF representation is directly available as an intermediate signal in the HEAAC coder. Figure 3.13 shows the block diagram of MPEG Surround-HEAAC encoder. The output signals of the hybrid synthesis filterbank are directly fed to the SBR tool.
QMF Analysis
Hybrid
Analysis Downmix
Spatial Parameter Estimation
Hybrid Synthesis
SBR Encoder
SBR data
Spatial parameter
AAC Encoder
AAC data HEAAC
Encoder
QMF Synthesis
(QMF domain data)
Figure 3.13: Interconnection of MPEG Surround and HEAAC encoder
Chapter 4
DSP Implementation Environment
We select the TI DSP platform to implement the MPEG Surround encoder. The DSP baseboard (SMT395) is made by Sundance which houses Texas Instruments’ TMS320C6416T DSP chip and Xilinx Virtex-II Pro FPGA. Because our implementation is mainly in software, the discussions in this chapter focus on the DSP system environment, DSP chip and its features. Then, the software development tool, Code Composer Studio (CCS), is introduced.
At the end, some important acceleration techniques and features which can reduce stalls or hazards on DSP system are also included.
4.1 DSP Baseboard (SMT395)
The block diagram of the Sundance DSP baseboard system (SMT395) is shown in Figure 4.1 [18]. SMT395 utilizes the signal processing technology to provide extreme processing flexibility and high performance. Some important features of SMT395 are listed as follows.
1GHz TMS320C6416T fixed point DSP processor with L1, L2 cache and SDRAM.
8000MIPS peak performance.
Xilinx Virtex-II Pro FPGA. XC2V920-6 in FF896 package.
Two Sundance High Speed Bus (100MHz, 200MHz) ports which is 32 bits wide.
Eight 2Gbit/sec Rocket Serial Links(RSL) for interModule.
8 MB flash ROM for configuration and booting.
Six common ports up to 20 MB per second for inter DSP communication.
JTAG diagnostics port.
DSP C6416T 512pins, 1.4V FPGA
Controller Virtex-II Pro
1.5V core 3.3V I/O 2x Comm-ports Clock, timer, interrupt4x Comm-ports
Global Bus
J1 Top primary TIM Connector Comm-port 0 & 3
J3 Global Expansion Connector
J2 Bottom primary TIM Connector Comm-port 1, 2, 4 & 5
McBSP/Utopia/GPIO (all non-TIM I/O pins)
4 LEDs 4 I/O pins
2x Sundance RSL Connector (Xilinx
Rocket-IO) 28-way Samtec QxE
JTAG Header
2x Sundance High Speed Bus (SHB) 60-way Samtec QSH
256Mbytes SDRAM (4×32M×16 133MHz)
EMIFA
8Mbyte Flash
Oscillators 4 LCDs
104 I/O pins 133 MHz EMIFA 8 differential serial links
120 I/O pins, 32-bit Data
Figure 4.1: The block diagram of the Sundance DSP Baseboard system [18]
4.2 DSP chip
The TMS320C6416T DSP is using the VelociTI.2 architecture [19]. The VelociTI.2 is a high performance, advanced very long instruction word (VLIW) architecture, making it an
The TMS320C6416T DSP is using the VelociTI.2 architecture [19]. The VelociTI.2 is a high performance, advanced very long instruction word (VLIW) architecture, making it an