Overview of the thesis - 環繞MPEG編解碼器之增速及其在TI DSP平台上的實現

Chapter 1 Introduction

1.2 Overview of the thesis

This thesis focuses on developing fast methods for improving the encoding speed of the MPEG Surround codec on the DSP platform. It is organized as follows. In Chapter2, we introduce the spatial hearing phenomena perceived by human. Chapter 3 discusses the algorithm of the MPEG Surround encoder. In Chapter 4, we describe the DSP development environment and the acceleration methods for the TI C6416T DSP system. Chapter 5 discusses our proposed algorithms to accelerate the MPEG Surround encoder on the DSP platform. Finally, we give a conclusions and future work in Chapter 6.

Chapter 2 Spatial Hearing

The sense of hearing creates an auditory image of an external environment. Spatial hearing is the process by which the auditory image is perceived at a particular place. The physical source is called the “sound event”, and the perceived sound is the “auditory event” or

“auditory object”. Since our ability to distinguish and segregate the source is mainly based on the localization of sound source in space, this chapter deals with sound source localization on a horizontal, which provides the base for the Spatial Audio Coding. We will introduce the parameters that are relevant to sound perception and discuss how these phenomena related to the commonly used audio playback systems. The details of spatial hearing can be found in [1].

2.1 Spatial Hearing with One Sound Source

In order to understand how the auditory system distinguishes the direction of a source, the properties of the signals at the ear entrances have to be considered. Generally, the ear input signals can be viewed as being filtered versions of the source signal. The head-related transfer function (HRTF) describes the path of a given sound source to the ear entrances.

Figure 2.1 schematically shows a human head and a distant sound source x(n) at angleθ. The entrance signals of the left and right ears are xL(n) and xR(n) respectively and they are denoted by two filtered signals through the HRTFs with h_L(n) and h_R(n), respectively. The distances from the source to the left and the right ears are dL and dR.

θ

) (n

x_L x_R(n) ) (n h_R dL d_R

L d

d −

) (n h_L

) (n x

Figure 2.1: HRTFs for left and right ears

2.1.1 Interaural Time Difference and Interaural Level Difference

As a result of different path lengths to the ear entrances, dL-dR, there is a difference in the arrival time between both ear entrances, and denoted as interaural time difference (ITD). In addition, the wave fronts from the source impinge upon the left ear directly while the sound received by the right ear is diffracted around the head. This diffraction causes an intensity difference between the left and right ear entrance signals, denoted as interaural level difference (ILD).

In 1907, Lord Rayleigh proposed the duplex theory, which provides an explanation for the ability to localize sounds in the horizontal plane based on the notion of the ITD and ILD (when considering only the frontal directions, -90°≦θ≦90°).

Interaural Time Difference (ITD)

ITD is related to the phase of the HRTF ratio and is generally thought to be one of the most important localization cues. It is used to localize low frequency sounds (below 1.5 kHz). In such a frequency range, the wavelength of the sound source is greater than the time delay between the ears. Therefore, there is a phase difference between the sound waves to provide

acoustic localization cues. In contrast, at a high frequency, because the wavelength of the sound source is shorter than the distance between the ears, a localization error may occur. The ITD can be estimated by the normalized cross-correlation function as follows.

) ,

Interaural Level Difference (ILD)

ILD, derived from the amplitude of the HRTF ratio.



It is a complicated function of frequency because for any given source positions the peaks and valleys in the HRTF may appear at different frequencies in two ears. Moreover, the ILD is small at low frequencies, regardless of the source position, because the dimensions of the head and pinna are small when compared to the wavelengths of sound at frequencies below about 1.5 kHz. For these reasons, the ILD at individual frequency bands are more likely to be useful localization cues than the overall ILD.

When considering only the frontal direction, a specific ITD-ILD pair can be associated with the perceptual source direction. Figure 2.2 shows an experimental setup for generating left and right ear entrance signals, xL(n) and xR(n), with a single source signal x(n). Different ITD-ILD pairs determine the different locations of the auditory events which appear in the frontal sections of the upper head. The ITD is determined by the delays tL-tR and ILD is equal to 20log10(aL/aR) dB. When left and right signals have the same level and no delay (i.e. ILD=0, ITD=0), an auditory event occurs in a location central to the listener (Region 1). By increasing the intensity of the right side, the auditory event moves from Region 1 to Region 2.

When only the left signal is active, the auditory event appears at the left ear as shown in Region 3. The ITD can be used to control the perceptual position of the auditory event in a similar way.

Figure 2.2: Generating the location of auditory event with specific ITD and ILD [1]

2.1.2 Interaural Coherence (IC)

In a reverberant environment, additional effects such as reflection, diffraction and resonance may cause the signals between left and right ears to be incoherent. In order to measure the degree of similarity between the left and right ear entrance signals, another spatial parameter, interaural coherence (IC), is considered. This coherence is derived from the maximum absolute value of the normalized cross-correlation function:

) ( ) (

max 2 2

d n x n x

d n x n x IC

L R n

L R

d +

∑

^∞

−∞

= ,

where d corresponds to a small delay.

Figure 2.3 shows that the width (or spatial diffuseness) of the perceived auditory spatial image mostly depends on the IC cues. When the ear entrance signals are identical (IC=1), a compact auditory event is perceived, as illustrated in Region 1. On the other hand, the width of the auditory event increases as the IC decreases (Region 2 and 3). Finally, when the ear entrance signals are independent (IC=0), two distinct auditory events are perceived at the sides (Region 4).

Figure 2.3: Width of auditory event [2].

2.2 Spatial Hearing with Two Sound Sources

The most commonly used consumer playback system for spatial audio systems is the stereo loudspeaker setup. Thus, it is interesting to investigate the spatial hearing with two sound sources. The previous section shows the perceptual effects of ITD, ILD, and IC cues.

Similar to these interaural cues, there are three properties of the signals between two loudspeakers

Inter-channel time difference (ICTD)

Inter-channel level difference (ICLD)

Inter-channel coherence (ICC)

In the following paragraphs a few phenomena related to ICTD, ICLD and ICC are reviewed for two sources located in front of a listener.

Figure 2.4: ICTD and ICLD between a pair of coherent source signals [1]

Figure 2.4 illustrates different locations of the perceived auditory events controlled by the ICLD-ICTD pair. When the signals emitted from two loudspeakers are identical (ICTD=0 and ICLD=0), an auditory event appears in the center between the two sources as indicated as Region 1. By increasing the intensity of the right channel, the auditory event moves from Region 1 to Region 2. In the extreme case that only the signal on the left is active, the auditory event appears at the left source position (Region 3). As with the IC, the ICC between two loudspeakers is also related to the width of the auditory event, as shown in Figure 2.5.

When the ICC between the two loudspeakers decreases, the width of the auditory event increases.

Figure 2.5: ICC with width of the auditory event [1]

2.3 Multi-channel Environment

Multi-channel audio is the name for a variety of techniques used to expand and enrich the sound of audio playback by recording additional sound channels that can be reproduced through additional speakers. Such systems are known as the “home theater systems” for movies. Figure 2.6 illustrates the standard loudspeaker setup for a 5.1 five-to-one (5.1) surround audio system. In the front, three loudspeakers are located at angles -30°, 0°, and +30°. The two surround loudspeakers at the rear, offset by ±110°, are intended to provide the important lateral signal components related to spatial impression. Additionally, one low-frequency effects (LFE) channel is used to carry extremely low sub-bass cinematic effects.

Figure 2.6: Standard 5.1 surround audio system

2.3.1 Generating Sound for 5.1 Systems

Sound source location can often be reproduced successfully using multi-channel recordings. Techniques applied for recording or mixing a two-channel stereo can be applied to a specific channel pair of the five main loudspeakers of a 5.1 setup. For example, to obtain an auditory event from a specific direction, the loudspeaker pair enclosing the desired direction is selected and the corresponding signals are recorded or generated in a way similar to that of the stereo case (resulting in auditory events between the two selected loudspeakers).

2.4 Conclusions

The main point emphasized here is that the perceptual direction of a sound source is determined by ITD, ILD, ICTD, and ICLD. The other parameters, IC and ICC, are used to measure the width (spatial diffuseness) of perceived auditory spatial image. These parameters also play an important role for capturing and generating sound in spatial audio systems, such as stereo or multi-channel audio playback.

Chapter 3 MPEG Surround

In this chapter, we will briefly review several stereo and multi-channel algorithms which are related to the MPEG surround coding, including Intensity Stereo Coding (ISC), Parametric Stereo coding (PS), and Binaural Cue Coding (BCC). Then we will introduce the basic concepts and major modules of the MPEG Surround coder. It is used for compressing a multi-channel audio signal at very low bitrate. It provides an extremely efficient method for coding multi-channel sounds. Finally, an inter-connection of MPEG Surround and MPEG-4 HEAAC structure will be described.

3.1 Related Techniques

3.1.1 Intensity Stereo Coding [3]

Intensity stereo coding (ISC) is a joint-channel coding technique that is a part of the ISO/IEC MPEG family of standards [4]. By removing perceptually irrelevant information between audio channel pairs, it reduces the bit-rate needed for encoding stereo or multi-channel signals. It is more efficient than coding of each channel separately. ISC exploits the fact that the human hearing system is sensitive to low frequency signals at both amplitude and phase; it also sensitive to amplitude of high frequency signals, but low sensitive to phase.

Thus, at high frequencies, the original left and right subband signals are replaced by a sum signal and a direction angle (azimuth) which controls the intensity stereo position of the auditory event created at the decoder.

3.1.2 Parametric Stereo Coding [5] [6]

Since ISC is prone to aliasing artifacts and typically is only applied for higher frequency bands, Parametric Stereo (PS) technology is proposed to overcome these limitations. PS is standardized in MPEG-4 and is the next major step to enhance the efficiency of audio compression for low bit-rate stereo signals. It is in conjunction with the context of the MPEG-4 HE-AAC (aacPlus) codec, known as HE-AAC v2, or Enhanced aacPlus.

PS employs a dedicated (complex-valued, over-sampled) filter bank to avoid artifacts due to aliasing resulting from the spectral modification in generating the output channels. In additions, it synthesizes not only the intensities but also the phase differences and coherence between the output channels. Due to these improvements, the PS tool can operate on the full audio bandwidth. In such a system, the stereo signal (a pair of signal) is reconstructed from the transmitted mono signal with the help of the stereo parameters.

3.1.3 Binaural Cue Coding [7] [8]

The Binaural Cue Coding (BCC) approach can be viewed as a generalization of the parametric stereo idea, delivering multi-channel output (with an arbitrary number of channels) from a single audio channel plus some side-information. Figure 3.1 shows the generalized block diagram of BCC encoder and decoder.

ˆx1

ˆx2

xˆN

Figure 3.1: Generic Scheme for binaural cue coding (BCC)

In the encoder, multi-channel input channels are combined into a single sum signal by using a downmix process. At the same time, the multi-channel sound image is extracted and parameterized as BCC side-information. The decoder is able to reproduce multi-channel output signals by these data. Because BCC requires only a few bit-rates (2 kb/s) to encode the side-information, the total bit-rate is only slightly higher than what is required to represent a mono audio signal. Another advantage of this scheme is its backwards compatibility with non-multi-channel audio systems. For the receivers that do not support multi-channel sound audio, it simply ignores the side-information and decodes the sum signal.

3.1.3.1 Estimation of BCC parameters

As shown in Figure 3.2(a), the BCC parameters including the inter-channel level difference (ICLD), the inter-channel time difference (ICTD), and the inter-channel coherence (ICC) are estimated in the subband domain. The estimation process is applied independently to each subband.

)

1(n x

)

2(n x

) (n x_N

)

1(k X

)

2(k X

) (k X_N

(a) (b)

Figure 3.2: BCC parameters estimation

Figure 3.2(b) shows an example of 5-channel environment. The ICTD and ICLD between a reference channel (e.g. Left channel) and the other channels are estimated. One single ICC is estimated between the channel pair with the largest power, to describe the overall coherence among all audio channels.

3.1.3.2 Synthesis of BCC parameters

BCC synthesis scheme is shown in Figure 3.3. First, the downmixed sum signal is converted into the frequency domain via a filter bank. For each output channel, individual time delays and scale factors are imposed on the spectral coefficients to re-synthesis ICTD and ICLD respectively. Followed by a coherence synthesis process, ICC is synthesized.

Finally, all output channels are converted back into the time domain signals.

t₁(k) FB

IFB IFB IFB

h_N(k) h₂(k) h₁(k)

t_N(k) t₂(k)

a₁(k)

a₂(k)

a_N(k) )

s S(k)

) ˆ (

1 k X

) ˆ (

2 k X

) ˆ (k X_N

) ˆ₁(n x

) ˆ₂(n x

) ˆ (n x_N

Figure 3.3: BCC synthesis

3.2 MPEG Surround

MPEG Surround can be viewed as an enhancement of the techniques we previously mention, such as a multi-channel extension of Parametric Coding or a generalized version of BCC. In the following sections, we will describe the standardization process of MPEG Surround and its structure.

3.2.1 MPEG Surround Standardization Process

Motivated by the demonstrated potential of what was then called the Spatial Audio Coding approach, ISO/MPEG started a new work item on the parametric coding of multi-channel audio signals by issuing a CfP (Call for proposal) on Spatial Audio Coding in

March 2004 [9]. Four responses were received and evaluated with a number of performance measures including subjective quality of the decoded multi-channel audio signal, the subjective quality of the downmix signals, the spatial cue side information bitrate and the other parameters, such as additional functionality and computational complexity.

As a result of these extensive evaluations, MPEG committee decided that the technology that would be the starting point in the standardization process, called Reference Model 0 (RM0), would be a combination of the submissions from two proponents: Fraunhofer IIS/Agere Systems and Coding Technologies/Philips. These systems not only outperformed the other submissions but also showed complementary performance in terms of per-item quality, bitrate and complexity. Consequently, the merged RM0 (now called MPEG Surround) is designed to combine the best features of both individual systems and was found to fully meet the performance expectation. RM0 provides sound quality substantially the surpassing existing matrixed surround solutions, even for the transmission of a mono downmix signal or for the spatial cue bitrates as low as 6kbit/s. It serves as the basis for the further technical development within the MPEG-4 audio. An extended description of the technology can be found in [10] and [11].

3.2.2 MPEG Surround Reference Model 0 Scheme

Rather than performing a discrete coding of the individual audio input channels, Spatial Audio Coding is a technique to efficiently code a multi-channel audio signal as stereo (or even monaural) signal plus a small amount side information for multi-channel spatial image parameters. Figures 3.4 and 3.5 show the block diagram of the MPEG Surround RM0 encoder and decoder, respectively. The input signals are processed by the analysis filter banks to decompose the input signals into separate frequency bands. The frequency selectivity of these filter banks is tailored specifically towards mimicking the frequency resolution of the human auditory system. Then the MPEG Surround encoder captures the spatial image of a multi-channel audio signal and condenses it into a compact set of parameters. These

parameters typically include level/intensity differences and measures correlation/coherence between the audio channels. In parallel, a stereo (or monaural) downmix signal of the sound material is created. The downmix signal is transformed back to the time-domain signal by using the synthesis filter banks. And it is transmitted to the decoder together with the spatial information. On the decoder side the transmitted downmix signal is expanded into high quality multi-channel outputs based on the known spatial parameters.

Figure 3.4: MPEG Surround Encoder Overview [15]

ˆx1

ˆx2

xˆN

ˆs1

ˆs2

Figure 3.5: MPEG Surround Decoder Overview [15]

Moreover, to achieve a higher compression rate, a MPEG Surround Coding can be combined with a conventional state-of-the-art coder (Audio Encoder and Audio Decoder in Figures 3.4 and 3.5). The downmix signal is encoded with a core coder such as the MPEG-1

Layer III (mp3), MPEG-2/4 AAC or MPEG-4 High Efficiency AAC, or it could even be PCM.

In this way, MPEG Surround coder acts as a pre-process to the audio encoder, and as a post-process to the core decoder. Thus, the MPEG Surround Coding is able to provide complete backward compatibility with the non-multi-channel audio systems using the downmix signal: A receiver device without a MPEG Surround decoder will simply decode and present downmix signal.

3.2.3 Time to Frequency Transform

In the human auditory system, the processing of binaural cues is performed on a non-uniform frequency scale. Since the spatial parameters are estimated (at the encoder side) and applied (at the decoder side) as a function of time and frequency, both the encoder and decoder require a transform or filter bank that resemble this non-uniform scale. Furthermore, the transform or filter bank should be over-sampled, since time- and frequency-dependent signal modifications will be made to the signals which would lead to audible aliasing distortion in a critically-sampled system.

It employs a two-stage filter bank to satisfy the above requirments. Figure 3.6 and Figure 3.7 shows the structure of the hybrid QMF analysis and synthesis filter banks, respectively.

The first-stage filter bank is a complex-modulated Quadrature Mirror Filter (QMF) bank to obtain a uniform, over-sampled, frequency representation of the audio signal. The signals of the lowest QMF subbands are subsequently fed through a second complex-modulated filter bank to provide a higher resolution of low frequencies.

)

Figure 3.6: Hybrid QMF analysis filter bank providing 71 output bands. The input is fed into a 64-band analysis QMF bank (dashed box). The three lower QMF subbands are further split to increase low frequency resolution (see shadowed box).

Figure 3.7: Hybrid QMF synthesis filter bank using 71 input bands. The low frequency coefficients are simply added (see shadowed box) prior to the synthesis with the QMF.

3.2.4 Analysis Quadrature Mirror Filter (AQMF) Bank

The first filter bank is compatible with the filter bank used in the SBR algorithms [17].

The subband signals are generated by this filter bank are obtained by convolving the input signal with a set of analysis filter impulse response h_k[n] given by:







 



 



 −



 



 +

= 2

1 2

exp 1 ] [ ]

[ k n

M n j

p n

h_k π

where p[n] represents the low-pass prototype filter impulse with 640 filter length, M represents the number of frequency bands (M=64) and k, the subband index (k=0,…,M-1).

The filtered outputs are subsequently down sampled by a factor M resulting in the down-sampled QMF outputsX_k[n]=(x∗h_k)[Mn].

The equation above is purely analytical. In practice, the computational complexity can be reduced by using the poly-phase decomposition method as described in the following steps, in which an array x consisting of 640 time domain input samples are assumed. Higher indices in the array correspond to older samples. Figure 3.8 shows the QMF analysis window.

core coder samples

0 1024 2048

core coder signal 640 samples

2624 576

(frame size 2048)

29 30 2

Figure 3.8: QMF analysis windowing [17]. Index 0 to 31 represent 32 windows.

The QMF process is as follows.

1. Shift the samples in the array x by 64 positions. The oldest 64 samples are discarded and the 64 new samples are stored in positions 0 to 63.

在文檔中環繞MPEG編解碼器之增速及其在TI DSP平台上的實現 (頁 14-0)