環繞MPEG編解碼器之增速及其在TI DSP平台上的實現

全文

(1)國立交通大學電子工程學系電子研究所碩士班碩. 士. 論. 文. 環繞 MPEG 編解碼器之增速及其在 TI DSP 平台上的實現. MPEG Surround Codec Acceleration and Implementation on TI DSP Platform. 研究生：韓志岡指導教授：杭學鳴. 博士. 中華民國九十六年六月.

(2) 環繞 MPEG 編解碼器之增速及其在 TI DSP 平台上的實現 MPEG Surround Codec Acceleration and Implementation on TI DSP Platform 研究生: 韓志岡指導教授: 杭學鳴. Student: Chih-Kang Han Advisor: Dr. Hsueh-Ming Hang. 國立交通大學電子研究所碩士班電子工程學系碩士論文. A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical and Computer Engineering National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master in Electronics Engineering. June 2007. HsinChu, Taiwan, Republic of China. 中華民國九十六年六月.

(3) 環繞 MPEG 編解碼器之增速及其在 TI DSP 平台上的實現研究生: 韓志岡. 指導教授: 杭學鳴博士. 國立交通大學. 電子工程學系電子研究所碩士班. 摘要隨著音響系統的發展，多聲道系統(Multi-channel audio system)已經廣泛的應用在消費性電子產品上，如 DVD 和數位音訊的廣播服務。環繞 MPEG(MPEG Surround)是由 ISO/IEC MPEG 所制定的一套標準，它能夠在非常低的位元率下壓縮多聲道的音訊訊號。在本篇論文中，對於環繞 MPEG 編碼器的模組，我們提供了較快速的演算法，並且符合 DSP 系統的加速。我們首先分析環繞 MPEG 編碼器在 DSP 平台上的執行計算複雜度，發現濾波堆在整個系統中耗費最多的計算量，因此我們利用離散正/離弦轉換和快速傅立葉轉換來實現濾波堆的運算並且減少運算量。在低音聲道(LFE channel)中，我們也避免了不必要的計算。此外針對 DSP 的架構，我們使用了一些加速的方法像是定點數運算、使用 L2 快取記憶體、TI DSP 的特殊指令群、迴圈的分解與巨集指令。經由以上的加速方法，在 TI DSP 系統上可對環繞 MPEG 編碼器加速約 500 倍。另外我們也結合了環繞 MPEG 編碼器和 MPEG-4 HEAAC 編碼器，將之實現在 DSP 平台上，並利用 downsampled SQMF(synthesis Quadrature Mirror Filter)簡化架構。相較於原本的架構，使用簡化的架構可將整體速度提升至約 1.8 倍。. i.

(4) MPEG Surround Codec Acceleration and Implementation on TI DSP Platform Student: Chih-Kang Han. Advisor: Dr. Hsueh-Ming Hang. Department of Electronic Engineering Institute of Electronics National Chiao Tung University. Abstract With the fast development of the audio systems, multi-channel audio systems are commonly used for consumer electronic products such as DVD audio and digital audio broadcasting services. The MPEG Surround is an efficient audio coding standard defined by the ISO/IEC MPEG (Moving Pictures Experts Groups) committee. It is able to compress the multi-channel audio signals at a very low bit-rate. In this thesis, we propose several methods to speed up the MPEG Surround implemented on a DSP platform. We analyze the complexity of the MPEG Surround encoder and find that among all modules the filterbanks require the most operational cycles on DSP. Hence, we adopt and modify several fast algorithms based on type-IV DCT/DST and FFT for implementing the filterbanks. We also eliminate the unnecessary computation in the LFE (low frequency effect) channel. For the DSP implementation, we use a few DSP code acceleration techniques to speed up such as fixed-point arithmetic, L2 cache, intrinsic function, loop unrolling and DSP macro function. The experimental results show that the modified MPEG Surround encoder is about 500 times faster than the original version. Furthermore, we implement the combination of the accelerated MPEG Surround encoder and the HEAAC encoder together, and simplify the structure by using the downsampled ii.

(5) SQMF bank. Comparing to the original structure, the simplified encoder is about 1.8 times faster.. iii.

(6) 誌謝這篇論文能夠順利完成，首先要感謝我的指導教授杭學鳴老師，在這兩年的研究生涯中，杭老師總是以積極的態度鼓勵我們，不僅在研究上有所指導，也關心我們的日常生活，老師認真踏實的研究態度，更讓我獲益匪淺。感謝通訊電子與訊號處理實驗室(commlab)，提供了充足的軟硬體資源及良好環境，讓我在研究中不虞匱乏。同時也感謝實驗室的學長、同學、及學弟們，繼大、育彰、鴻志、建志學長時常為我解答課業上的疑問，使我在研究的過程能保持順利。還有在實驗室一同打拼的凱庭、育成、浩廷、耀仚、介遠、柏昇、政達、耀鈞、錫祺等同學們，在這兩年的期間陪伴著我，並不斷的互相討論及勉勵。也要謝謝其他幫助我支持我的好朋友們。最後，要感謝我的父母，還有我的哥哥，有了你們的栽培及支持，我才能心無旁騖的完成學業。. 謝謝所有陪我走過這一段歲月的師長、家人及朋友們! 誌於 2007.6 風城交大志岡. iv.

(7) Contents 摘要 .............................................................................................................................................i Abstract .....................................................................................................................................ii 誌謝 ...........................................................................................................................................iv Contents.....................................................................................................................................v List of Tables ..........................................................................................................................viii List of Figures ..........................................................................................................................ix Chapter 1 Introduction................................................................................................................1 1.1 Introduction and Motivation.........................................................................................1 1.2 Overview of the thesis ..................................................................................................2 Chapter 2 Spatial Hearing...........................................................................................................3 2.1 Spatial Hearing with One Sound Source ......................................................................3 2.1.1 Interaural Time Difference and Interaural Level Difference.............................4 2.1.2 Interaural Coherence (IC)..................................................................................6 2.2 Spatial Hearing with Two Sound Sources ....................................................................7 2.3 Multi-channel Environment..........................................................................................8 2.3.1 Generating Sound for 5.1 Systems ....................................................................9 2.4 Conclusions ..................................................................................................................9 Chapter 3 MPEG Surround ......................................................................................................10 3.1 Related Techniques.....................................................................................................10 3.1.1 Intensity Stereo Coding [3]..............................................................................10 3.1.2 Parametric Stereo Coding [5] [6]..................................................................... 11 3.1.3 Binaural Cue Coding [7] [8]............................................................................ 11 3.2 MPEG Surround .........................................................................................................13 3.2.1 MPEG Surround Standardization Process .......................................................13 3.2.2 MPEG Surround Reference Model 0 Scheme.................................................14 3.2.3 Time to Frequency Transform.........................................................................16 v.

(8) 3.2.4 Analysis Quadrature Mirror Filter (AQMF) Bank ..........................................19 3.2.5 Hybrid Filterbank for improved frequency resolution ....................................21 3.2.6 Subband Partition ............................................................................................22 3.2.7 Spatial Audio Parameters ................................................................................23 3.3 Combination of MPEG Surround and HE-AAC ........................................................26 Chapter 4 DSP Implementation Environment ..........................................................................27 4.1 DSP Baseboard (SMT395) .........................................................................................27 4.2 DSP chip .....................................................................................................................28 4.2.1 Central Processing Unit (CPU) .......................................................................30 4.2.2 Memory and Peripherals..................................................................................31 4.3 TI DSP Code Development Environment ..................................................................32 4.3.1 Code Composer Studio (CCS).........................................................................32 4.3.2 Code Development Flow .................................................................................33 4.4 DSP Code Acceleration Methods ...............................................................................34 4.4.1 Compiler Option Setting..................................................................................34 4.4.2 Fixed-point Coding..........................................................................................35 4.4.3 Loop Unrolling ................................................................................................36 4.4.4 Intrinsics and Packet Data Processing .............................................................36 4.4.5 Register and Memory Arrangement ................................................................37 4.4.6 Using Marcos...................................................................................................38 4.4.7 Linear Assembly..............................................................................................38 Chapter 5 Acceleration of MPEG Surround Encoder on DSP Platform ..................................39 5.1 Software Environment................................................................................................39 5.1.1 MPEG Surround RM0 Reference Software ....................................................39 5.1.2 Command Line Switches.................................................................................39 5.2 MPEG Surround Encoder Complexity Analysis ........................................................41 5.3 Memory System..........................................................................................................43 5.4 Floating-point to Fixed-point Conversion ..................................................................45 vi.

(9) 5.4.1 Word-length Determination .............................................................................45 5.4.2 Simulation Results on DSP..............................................................................46 5.5 Fast QMF Bank Algorithms .......................................................................................48 5.5.1 Problem Definition ..........................................................................................48 5.5.2 Analysis Quadrature Mirror (AQMF) Bank ....................................................48 5.5.3 N/2-point FFT Algorithm for DCT-IV and DST-IV ........................................53 5.5.4 Using DSP Library ..........................................................................................54 5.5.5 QMF Bank Acceleration Results .....................................................................55 5.6 Fast Hybrid Filterbank Algorithms.............................................................................56 5.6.1 Fast Analysis Hybrid Filterbank ......................................................................57 5.6.2 Analysis Hybrid Filterbank Implementation Results ......................................59 5.7 LFE Channel Acceleration..........................................................................................59 5.8 Implementation of MPEG Surround-HEAAC Encoder .............................................62 5.8.1 HEAAC Encoder .............................................................................................62 5.8.2 Complexity Analysis .......................................................................................63 5.8.3 Simplified MPEG Surround-HEAAC Structure..............................................64 5.9 Experiments and Acceleration Results .......................................................................65 5.9.1 MPEG Surround Encoder (without HEAAC) .................................................66 5.9.2 MPEG Surround-HEAAC Encoder.................................................................67 Chapter 6 Conclusions and Future Work..................................................................................69 6.1 Conclusions ................................................................................................................69 6.2 Future Work ................................................................................................................70 References................................................................................................................................71 Appendix A N/2-point FFT Algorithm for DCT-IV and DST-IV.............................................74 A.1 Relationship to O2DFT ..............................................................................................74 A.2 Fast Algorithm for Symmetric O2DFT ......................................................................75 自傳 ..........................................................................................................................................77. vii.

(10) List of Tables Table 3.1: Overview of low frequency split ....................................................................................................... 21 Table 4.1: Options that control optimization [22]............................................................................................. 34 Table 4.2: Processing time on the C64x DSP for different data types ............................................................. 35 Table 4.3: Comparison between Rolling and Unrolling ................................................................................... 36 Table 5.1: Tree structure of the MPEG Surround reference software encoder [10] ...................................... 40 Table 5.2: Bitrates of different argument setting .............................................................................................. 41 Table 5.3: Cycles on different simulators .......................................................................................................... 42 Table 5.4: The effect of using L2 cache memory ............................................................................................... 44 Table 5.5: Performance of fixed-point codes ..................................................................................................... 46 Table 5.6: SNR due to fixed-point conversion ................................................................................................... 46 Table 5.7: Comparison of C-complied FFT and DSPLIB FFT........................................................................ 55 Table 5.8: The acceleration result of AQMF bank............................................................................................ 56 Table 5.9: The Acceleration result of the SQMF bank ..................................................................................... 56 Table 5.10: The acceleration result of analysis hybrid filterbank ................................................................... 59 Table 5.11: Acceleration result of LFE channel ................................................................................................ 61 Table 5.12: Profiling of the MPEG Surround-HEAAC encoder ..................................................................... 63 Table 5.13: Reduction ratio of using downsampled SQMF bank .................................................................... 65 Table 5.14: The final acceleration result of the MPEG Surround encoder..................................................... 66 Table 5.15: Profile of the proposed MPEG Surround encoder on C6416 simulator...................................... 67 Table 5.16: Acceleration result of the MPEG Surround-HEAAC encoder..................................................... 67 Table 5.17: Profiling of the proposed MPEG Surround-HEAAC encoder on C6416 simulator................... 68. viii.

(11) List of Figures Figure 2.1: HRTFs for left and right ears............................................................................................................ 4 Figure 2.2: Generating the location of auditory event with specific ITD and ILD [1] .................................... 6 Figure 2.3: Width of auditory event [2]. .............................................................................................................. 7 Figure 2.4: ICTD and ICLD between a pair of coherent source signals [1] ..................................................... 7 Figure 2.5: ICC with width of the auditory event [1] ......................................................................................... 8 Figure 2.6: Standard 5.1 surround audio system ................................................................................................ 9 Figure 3.1: Generic Scheme for binaural cue coding (BCC) ........................................................................... 11 Figure 3.2: BCC parameters estimation ............................................................................................................ 12 Figure 3.3: BCC synthesis................................................................................................................................... 13 Figure 3.4: MPEG Surround Encoder Overview [15] ...................................................................................... 15 Figure 3.5: MPEG Surround Decoder Overview [15] ...................................................................................... 15 Figure 3.6: Hybrid QMF analysis filter bank providing 71 output bands. The input is fed into a 64-band analysis QMF bank (dashed box). The three lower QMF subbands are further split to increase low frequency resolution (see shadowed box)................................................................................................. 17 Figure 3.7: Hybrid QMF synthesis filter bank using 71 input bands. The low frequency coefficients are simply added (see shadowed box) prior to the synthesis with the QMF. .............................................. 18 Figure 3.8: QMF analysis windowing [17]. Index 0 to 31 represent 32 windows. .......................................... 19 Figure 3.9: Coefficients of the QMF bank window........................................................................................... 20 Figure 3.10: Magnitude responses for the first 4 band of the QMF analysis filter bank.. (The magnitude. for k=0 is highlighted) ................................................................................................................................ 21 Figure 3.11: Magnitude response of the 8-band sub-filter bank. (subband q=0 is highlighted) ................... 22 Figure 3.12: (a) 5.1-to-2 MPEG Surround encoder (b) 5.1-to-1 MPEG Surround encoder.......................... 26 Figure 3.13: Interconnection of MPEG Surround and HEAAC encoder....................................................... 26 Figure 4.1: The block diagram of the Sundance DSP Baseboard system [18]................................................ 28 Figure 4.2: Block diagram of TMS320C6x DSP [23]........................................................................................ 29 ix.

(12) Figure 4.3: C64x DSP chip architecture [23]..................................................................................................... 30 Figure 4.4: Code development flow [20] ............................................................................................................ 33 Figure 4.5: Intrinsic functions of the TI C6000 series DSP (partial list) [20] ................................................. 37 Figure 5.1: Profiling of the MPEG Surround encoder on the C64xx simulator............................................. 42 Figure 5.2: Profiling of the MPEG Surround encoder on the C6416 simulator............................................. 43 Figure 5.3: Data word-lengths of the fixed-point encoder ............................................................................... 45 Figure 5.4: Profiling of the MPEG Surround encoder on the C6416 simulator (cache enabled) ................. 47 Figure 5.5: Signal flow graph of fast analysis QMF (real part), where the dotted lines denote subtract operations. .................................................................................................................................................. 51 Figure 5.6: Signal flow graph of fast analysis QMF (imagery part), where the dotted lines denote subtract operations. .................................................................................................................................................. 52 Figure 5.7: Block diagram of the fast DCT-IV and DST-IV [28]..................................................................... 54 Figure 5.8: TI Complex FFT library API [30] .................................................................................................. 55 Figure 5.9: Signal flow graph of the fast analysis hybrid filterbank............................................................... 58 Figure 5.10: The spectrum of the front-left channel signal.............................................................................. 60 Figure 5.11: The spectrum of the LFE channel signal...................................................................................... 60 Figure 5.12: Simplified analysis filterbanks for the LFE channel................................................................... 61 Figure 5.13: Interconnection of the MPEG Surround-HEAAC encoder........................................................ 62 Figure 5.14: The structure of the simplified MPEG Surround-HEAAC encoder.......................................... 64. x.

(13) Chapter 1 Introduction. 1.1 Introduction and Motivation Due to the recent advances of digital audio coding technology, the audio related devices play an important role in our daily life such as hand set and MP3 player. Today, the vast majority of audio playback equipment use traditional two-channel presentations (stereo). Stereo has been a mainstream consumer format for more than 40 years, and so it is not surprising that there is a search for new technologies that further enhance the listener experience. The move towards more reproduction channels (“multi-channel audio” or “surround sound”) is quite visible in the market place. This trend is driven by the movie sounds and multi-channel DVDs. However, the complexity and bitrate of the traditional audio coding schemes scale with the number of audio channels. Therefore, it requires a high-quality, low-bitrate audio coding mechanism that can serve both those using conventional stereo equipment and those using next-generation multi-channel equipment. MPEG stands for ISO “Moving Pictures Experts Groups.” It is a group working under the directives of the International Standard Organization (ISO) and the International Electro-technical Commission (IEC). This group work concentrates on defining the standards for coding moving pictures, audio and related data. MPEG audio technology has proposed many low bitrate audio coding such as MPEG-1 layer III (MP3), MPEG-4 Advanced Audio Coding (AAC) and MPEG-4 High-Efficiency AAC (HEAAC). The most recent effort within MPEG is to define a state of the art new standard, the “MPEG Surround”, which provides an extremely efficient method for coding multi-channel sound via the transmission of a 1.

(14) compressed stereo (or even mono) audio program plus a low-rate side-information channel. In this way, the backward compatibility is retained to pervasive stereo playback systems while the side information permits the next-generation players to present a high-quality multi-channel surround experience. In this thesis, we implement the MPEG Surround encoder on TI DSP platform. However, the complexity of MPEG Surround is very high due to the new technologies employed. So we adopt several algorithms to reduce its complexity significantly.. 1.2 Overview of the thesis This thesis focuses on developing fast methods for improving the encoding speed of the MPEG Surround codec on the DSP platform. It is organized as follows. In Chapter2, we introduce the spatial hearing phenomena perceived by human. Chapter 3 discusses the algorithm of the MPEG Surround encoder. In Chapter 4, we describe the DSP development environment and the acceleration methods for the TI C6416T DSP system. Chapter 5 discusses our proposed algorithms to accelerate the MPEG Surround encoder on the DSP platform. Finally, we give a conclusions and future work in Chapter 6.. 2.

(15) Chapter 2 Spatial Hearing. The sense of hearing creates an auditory image of an external environment. Spatial hearing is the process by which the auditory image is perceived at a particular place. The physical source is called the “sound event”, and the perceived sound is the “auditory event” or “auditory object”. Since our ability to distinguish and segregate the source is mainly based on the localization of sound source in space, this chapter deals with sound source localization on a horizontal, which provides the base for the Spatial Audio Coding. We will introduce the parameters that are relevant to sound perception and discuss how these phenomena related to the commonly used audio playback systems. The details of spatial hearing can be found in [1].. 2.1 Spatial Hearing with One Sound Source In order to understand how the auditory system distinguishes the direction of a source, the properties of the signals at the ear entrances have to be considered. Generally, the ear input signals can be viewed as being filtered versions of the source signal. The head-related transfer function (HRTF) describes the path of a given sound source to the ear entrances. Figure 2.1 schematically shows a human head and a distant sound source x(n) at angleθ. The entrance signals of the left and right ears are xL(n) and xR(n) respectively and they are denoted by two filtered signals through the HRTFs with hL(n) and hR(n), respectively. The distances from the source to the left and the right ears are dL and dR.. 3.

(16) x(n). dL. dR. hR (n). hL (n). θ dL − dR. x L (n). x R (n). Figure 2.1: HRTFs for left and right ears. 2.1.1 Interaural Time Difference and Interaural Level Difference As a result of different path lengths to the ear entrances, dL-dR, there is a difference in the arrival time between both ear entrances, and denoted as interaural time difference (ITD). In addition, the wave fronts from the source impinge upon the left ear directly while the sound received by the right ear is diffracted around the head. This diffraction causes an intensity difference between the left and right ear entrance signals, denoted as interaural level difference (ILD). In 1907, Lord Rayleigh proposed the duplex theory, which provides an explanation for the ability to localize sounds in the horizontal plane based on the notion of the ITD and ILD (when considering only the frontal directions, -90°≦θ≦90°).. Interaural Time Difference (ITD) ITD is related to the phase of the HRTF ratio and is generally thought to be one of the most important localization cues. It is used to localize low frequency sounds (below 1.5 kHz). In such a frequency range, the wavelength of the sound source is greater than the time delay between the ears. Therefore, there is a phase difference between the sound waves to provide. 4.

(17) acoustic localization cues. In contrast, at a high frequency, because the wavelength of the sound source is shorter than the distance between the ears, a localization error may occur. The ITD can be estimated by the normalized cross-correlation function as follows.   p LR (d , n) ,  p L (n − d1 ) ⋅ p R (n − d 2 ) . τ (n) = arg max  d. d1 = max{−d ,0}; d 2 = max{d ,0}. Where p LR (d , n) is a short-time estimate of the mean of x L (n − d1 ) x R (n − d 2 ) .. Interaural Level Difference (ILD) ILD, derived from the amplitude of the HRTF ratio..  p ( n)   . ∆L(n) = 10 log10  L p n ( )  R  It is a complicated function of frequency because for any given source positions the peaks and valleys in the HRTF may appear at different frequencies in two ears. Moreover, the ILD is small at low frequencies, regardless of the source position, because the dimensions of the head and pinna are small when compared to the wavelengths of sound at frequencies below about 1.5 kHz. For these reasons, the ILD at individual frequency bands are more likely to be useful localization cues than the overall ILD. When considering only the frontal direction, a specific ITD-ILD pair can be associated with the perceptual source direction. Figure 2.2 shows an experimental setup for generating left and right ear entrance signals, xL(n) and xR(n), with a single source signal x(n). Different ITD-ILD pairs determine the different locations of the auditory events which appear in the frontal sections of the upper head. The ITD is determined by the delays tL-tR and ILD is equal to 20log10(aL/aR) dB. When left and right signals have the same level and no delay (i.e. ILD=0, ITD=0), an auditory event occurs in a location central to the listener (Region 1). By increasing the intensity of the right side, the auditory event moves from Region 1 to Region 2. When only the left signal is active, the auditory event appears at the left ear as shown in Region 3. The ITD can be used to control the perceptual position of the auditory event in a similar way. 5.

(18) Figure 2.2: Generating the location of auditory event with specific ITD and ILD [1]. 2.1.2 Interaural Coherence (IC) In a reverberant environment, additional effects such as reflection, diffraction and resonance may cause the signals between left and right ears to be incoherent. In order to measure the degree of similarity between the left and right ear entrance signals, another spatial parameter, interaural coherence (IC), is considered. This coherence is derived from the maximum absolute value of the normalized cross-correlation function: ∞. ∑x. IC = max d. n = −∞. R. ( n) x L ( n + d ). x R2 (n) x L2 (n + d ). ,. where d corresponds to a small delay.. Figure 2.3 shows that the width (or spatial diffuseness) of the perceived auditory spatial image mostly depends on the IC cues. When the ear entrance signals are identical (IC=1), a compact auditory event is perceived, as illustrated in Region 1. On the other hand, the width of the auditory event increases as the IC decreases (Region 2 and 3). Finally, when the ear entrance signals are independent (IC=0), two distinct auditory events are perceived at the sides (Region 4).. 6.

(19) Figure 2.3: Width of auditory event [2].. 2.2 Spatial Hearing with Two Sound Sources The most commonly used consumer playback system for spatial audio systems is the stereo loudspeaker setup. Thus, it is interesting to investigate the spatial hearing with two sound sources. The previous section shows the perceptual effects of ITD, ILD, and IC cues. Similar to these interaural cues, there are three properties of the signals between two loudspeakers . Inter-channel time difference (ICTD). . Inter-channel level difference (ICLD). . Inter-channel coherence (ICC) In the following paragraphs a few phenomena related to ICTD, ICLD and ICC are. reviewed for two sources located in front of a listener.. . Figure 2.4: ICTD and ICLD between a pair of coherent source signals [1]. 7.

(20) Figure 2.4 illustrates different locations of the perceived auditory events controlled by the ICLD-ICTD pair. When the signals emitted from two loudspeakers are identical (ICTD=0 and ICLD=0), an auditory event appears in the center between the two sources as indicated as Region 1. By increasing the intensity of the right channel, the auditory event moves from Region 1 to Region 2. In the extreme case that only the signal on the left is active, the auditory event appears at the left source position (Region 3). As with the IC, the ICC between two loudspeakers is also related to the width of the auditory event, as shown in Figure 2.5. When the ICC between the two loudspeakers decreases, the width of the auditory event increases.. Figure 2.5: ICC with width of the auditory event [1]. 2.3 Multi-channel Environment Multi-channel audio is the name for a variety of techniques used to expand and enrich the sound of audio playback by recording additional sound channels that can be reproduced through additional speakers. Such systems are known as the “home theater systems” for movies. Figure 2.6 illustrates the standard loudspeaker setup for a 5.1 five-to-one (5.1) surround audio system. In the front, three loudspeakers are located at angles -30°, 0°, and +30°. The two surround loudspeakers at the rear, offset by ±110°, are intended to provide the important lateral signal components related to spatial impression. Additionally, one low-frequency effects (LFE) channel is used to carry extremely low sub-bass cinematic effects.. 8.

(21) Figure 2.6: Standard 5.1 surround audio system. 2.3.1 Generating Sound for 5.1 Systems Sound source location can often be reproduced successfully using multi-channel recordings. Techniques applied for recording or mixing a two-channel stereo can be applied to a specific channel pair of the five main loudspeakers of a 5.1 setup. For example, to obtain an auditory event from a specific direction, the loudspeaker pair enclosing the desired direction is selected and the corresponding signals are recorded or generated in a way similar to that of the stereo case (resulting in auditory events between the two selected loudspeakers).. 2.4 Conclusions The main point emphasized here is that the perceptual direction of a sound source is determined by ITD, ILD, ICTD, and ICLD. The other parameters, IC and ICC, are used to measure the width (spatial diffuseness) of perceived auditory spatial image. These parameters also play an important role for capturing and generating sound in spatial audio systems, such as stereo or multi-channel audio playback.. 9.

(22) Chapter 3 MPEG Surround. In this chapter, we will briefly review several stereo and multi-channel algorithms which are related to the MPEG surround coding, including Intensity Stereo Coding (ISC), Parametric Stereo coding (PS), and Binaural Cue Coding (BCC). Then we will introduce the basic concepts and major modules of the MPEG Surround coder. It is used for compressing a multi-channel audio signal at very low bitrate. It provides an extremely efficient method for coding multi-channel sounds. Finally, an inter-connection of MPEG Surround and MPEG-4 HEAAC structure will be described.. 3.1 Related Techniques 3.1.1 Intensity Stereo Coding [3] Intensity stereo coding (ISC) is a joint-channel coding technique that is a part of the ISO/IEC MPEG family of standards [4]. By removing perceptually irrelevant information between audio channel pairs, it reduces the bit-rate needed for encoding stereo or multi-channel signals. It is more efficient than coding of each channel separately. ISC exploits the fact that the human hearing system is sensitive to low frequency signals at both amplitude and phase; it also sensitive to amplitude of high frequency signals, but low sensitive to phase. Thus, at high frequencies, the original left and right subband signals are replaced by a sum signal and a direction angle (azimuth) which controls the intensity stereo position of the auditory event created at the decoder.. 10.

(23) 3.1.2 Parametric Stereo Coding [5] [6] Since ISC is prone to aliasing artifacts and typically is only applied for higher frequency bands, Parametric Stereo (PS) technology is proposed to overcome these limitations. PS is standardized in MPEG-4 and is the next major step to enhance the efficiency of audio compression for low bit-rate stereo signals. It is in conjunction with the context of the MPEG-4 HE-AAC (aacPlus) codec, known as HE-AAC v2, or Enhanced aacPlus. PS employs a dedicated (complex-valued, over-sampled) filter bank to avoid artifacts due to aliasing resulting from the spectral modification in generating the output channels. In additions, it synthesizes not only the intensities but also the phase differences and coherence between the output channels. Due to these improvements, the PS tool can operate on the full audio bandwidth. In such a system, the stereo signal (a pair of signal) is reconstructed from the transmitted mono signal with the help of the stereo parameters.. 3.1.3 Binaural Cue Coding [7] [8] The Binaural Cue Coding (BCC) approach can be viewed as a generalization of the parametric stereo idea, delivering multi-channel output (with an arbitrary number of channels) from a single audio channel plus some side-information. Figure 3.1 shows the generalized block diagram of BCC encoder and decoder. xˆ1 xˆ2. x1 x2. s xˆ N. xN. Figure 3.1: Generic Scheme for binaural cue coding (BCC). 11.

(24) In the encoder, multi-channel input channels are combined into a single sum signal by using a downmix process. At the same time, the multi-channel sound image is extracted and parameterized as BCC side-information. The decoder is able to reproduce multi-channel output signals by these data. Because BCC requires only a few bit-rates (2 kb/s) to encode the side-information, the total bit-rate is only slightly higher than what is required to represent a mono audio signal. Another advantage of this scheme is its backwards compatibility with non-multi-channel audio systems. For the receivers that do not support multi-channel sound audio, it simply ignores the side-information and decodes the sum signal.. 3.1.3.1 Estimation of BCC parameters. As shown in Figure 3.2(a), the BCC parameters including the inter-channel level difference (ICLD), the inter-channel time difference (ICTD), and the inter-channel coherence (ICC) are estimated in the subband domain. The estimation process is applied independently to each subband.. x1 (n). X 1 (k ). x2 (n). X 2 (k ). xN (n). X N (k ). (a). (b) Figure 3.2: BCC parameters estimation. Figure 3.2(b) shows an example of 5-channel environment. The ICTD and ICLD between a reference channel (e.g. Left channel) and the other channels are estimated. One single ICC is estimated between the channel pair with the largest power, to describe the overall coherence among all audio channels.. 12.

(25) 3.1.3.2 Synthesis of BCC parameters. BCC synthesis scheme is shown in Figure 3.3. First, the downmixed sum signal is converted into the frequency domain via a filter bank. For each output channel, individual time delays and scale factors are imposed on the spectral coefficients to re-synthesis ICTD and ICLD respectively. Followed by a coherence synthesis process, ICC is synthesized. Finally, all output channels are converted back into the time domain signals.. a1(k) s (n). FB. S (k ). t1(k). Xˆ 1 (k ). h1(k). IFB. xˆ1 (n). IFB. xˆ2 (n). IFB. xˆ N (n). a2(k) t2(k). Xˆ 2 (k ). h2(k) aN(k). tN(k). hN(k). Xˆ N (k ). Figure 3.3: BCC synthesis. 3.2 MPEG Surround MPEG Surround can be viewed as an enhancement of the techniques we previously mention, such as a multi-channel extension of Parametric Coding or a generalized version of BCC. In the following sections, we will describe the standardization process of MPEG Surround and its structure.. 3.2.1 MPEG Surround Standardization Process Motivated by the demonstrated potential of what was then called the Spatial Audio Coding approach, ISO/MPEG started a new work item on the parametric coding of multi-channel audio signals by issuing a CfP (Call for proposal) on Spatial Audio Coding in 13.

(26) March 2004 [9]. Four responses were received and evaluated with a number of performance measures including subjective quality of the decoded multi-channel audio signal, the subjective quality of the downmix signals, the spatial cue side information bitrate and the other parameters, such as additional functionality and computational complexity. As a result of these extensive evaluations, MPEG committee decided that the technology that would be the starting point in the standardization process, called Reference Model 0 (RM0), would be a combination of the submissions from two proponents: Fraunhofer IIS/Agere Systems and Coding Technologies/Philips. These systems not only outperformed the other submissions but also showed complementary performance in terms of per-item quality, bitrate and complexity. Consequently, the merged RM0 (now called MPEG Surround) is designed to combine the best features of both individual systems and was found to fully meet the performance expectation. RM0 provides sound quality substantially the surpassing existing matrixed surround solutions, even for the transmission of a mono downmix signal or for the spatial cue bitrates as low as 6kbit/s. It serves as the basis for the further technical development within the MPEG-4 audio. An extended description of the technology can be found in [10] and [11].. 3.2.2 MPEG Surround Reference Model 0 Scheme Rather than performing a discrete coding of the individual audio input channels, Spatial Audio Coding is a technique to efficiently code a multi-channel audio signal as stereo (or even monaural) signal plus a small amount side information for multi-channel spatial image parameters. Figures 3.4 and 3.5 show the block diagram of the MPEG Surround RM0 encoder and decoder, respectively. The input signals are processed by the analysis filter banks to decompose the input signals into separate frequency bands. The frequency selectivity of these filter banks is tailored specifically towards mimicking the frequency resolution of the human auditory system. Then the MPEG Surround encoder captures the spatial image of a multi-channel audio signal and condenses it into a compact set of parameters. These. 14.

(27) parameters typically include level/intensity differences and measures correlation/coherence between the audio channels. In parallel, a stereo (or monaural) downmix signal of the sound material is created. The downmix signal is transformed back to the time-domain signal by using the synthesis filter banks. And it is transmitted to the decoder together with the spatial information. On the decoder side the transmitted downmix signal is expanded into high quality multi-channel outputs based on the known spatial parameters.. x1 x2. s1. xN. s2. Figure 3.4: MPEG Surround Encoder Overview [15]. sˆ1. xˆ1 xˆ2. sˆ2 xˆN. Figure 3.5: MPEG Surround Decoder Overview [15]. Moreover, to achieve a higher compression rate, a MPEG Surround Coding can be combined with a conventional state-of-the-art coder (Audio Encoder and Audio Decoder in Figures 3.4 and 3.5). The downmix signal is encoded with a core coder such as the MPEG-1 15.

(28) Layer III (mp3), MPEG-2/4 AAC or MPEG-4 High Efficiency AAC, or it could even be PCM. In this way, MPEG Surround coder acts as a pre-process to the audio encoder, and as a post-process to the core decoder. Thus, the MPEG Surround Coding is able to provide complete backward compatibility with the non-multi-channel audio systems using the downmix signal: A receiver device without a MPEG Surround decoder will simply decode and present downmix signal.. 3.2.3 Time to Frequency Transform In the human auditory system, the processing of binaural cues is performed on a non-uniform frequency scale. Since the spatial parameters are estimated (at the encoder side) and applied (at the decoder side) as a function of time and frequency, both the encoder and decoder require a transform or filter bank that resemble this non-uniform scale. Furthermore, the transform or filter bank should be over-sampled, since time- and frequency-dependent signal modifications will be made to the signals which would lead to audible aliasing distortion in a critically-sampled system.. It employs a two-stage filter bank to satisfy the above requirments. Figure 3.6 and Figure 3.7 shows the structure of the hybrid QMF analysis and synthesis filter banks, respectively. The first-stage filter bank is a complex-modulated Quadrature Mirror Filter (QMF) bank to obtain a uniform, over-sampled, frequency representation of the audio signal. The signals of the lowest QMF subbands are subsequently fed through a second complex-modulated filter bank to provide a higher resolution of low frequencies.. 16.

(29) H 0 (ω ). M. G00 (ω ). X (2, n). G10 (ω ). X (3, n). G20 (ω ). X (4, n). G30 (ω ). X (5, n). G40 (ω ). G50 (ω ). H 1 (ω ). X (0, n). G70 (ω ). X (1, n). G01 (ω ). X (7, n). G11 (ω ). X (6, n). G02 (ω ). X (8, n). G12 (ω ). X (9, n). M. X H 2 (ω ). G60 (ω ). M. H 3 (ω ). M. G03 (ω ). X (10, n). H 63 (ω ). M. G063 (ω ). X (70, n). Figure 3.6: Hybrid QMF analysis filter bank providing 71 output bands. The input is fed into a 64-band analysis QMF bank (dashed box). The three lower QMF subbands are further split to increase low frequency resolution (see shadowed box).. 17.

(30) X (0, n) X (1, n) X (2, n) X (3, n). +. M. F0 (ω ). +. M. F1 (ω ). X (4, n) X (5, n). X (6, n). X ( 7, n ). X X (8, n). M. F2 (ω ). X (10, n). M. F3 (ω ). X (70, n). M. F63 (ω ). X (9, n). +. Figure 3.7: Hybrid QMF synthesis filter bank using 71 input bands. The low frequency coefficients are simply added (see shadowed box) prior to the synthesis with the QMF.. 18.

(31) 3.2.4 Analysis Quadrature Mirror Filter (AQMF) Bank The first filter bank is compatible with the filter bank used in the SBR algorithms [17]. The subband signals are generated by this filter bank are obtained by convolving the input signal with a set of analysis filter impulse response hk [n] given by:.  jπ hk [n] = p[n] exp M. 1  1    k +  n −  , 2  2  . where p[n] represents the low-pass prototype filter impulse with 640 filter length, M represents the number of frequency bands (M=64) and k, the subband index (k=0,…,M-1). The filtered outputs are subsequently down sampled by a factor M resulting in the down-sampled QMF outputs X k [n] = ( x ∗ hk )[ Mn] . The equation above is purely analytical. In practice, the computational complexity can be reduced by using the poly-phase decomposition method as described in the following steps, in which an array x consisting of 640 time domain input samples are assumed. Higher indices in the array correspond to older samples. Figure 3.8 shows the QMF analysis window.. 640 samples 0. 29. 1. 30 31. 2. core coder signal core coder samples 0. 576. 1024. 2048. 2624. (frame size 2048). Figure 3.8: QMF analysis windowing [17]. Index 0 to 31 represent 32 windows.. The QMF process is as follows. 1. Shift the samples in the array x by 64 positions. The oldest 64 samples are discarded and the 64 new samples are stored in positions 0 to 63. 2. Multiplying the samples of array x by window c to array Z ( Z [n] = x[n] × c[n] , for n=0 to 19.

(32) 639). The 640 window coefficients c are showed in Figure 3.9. 4. 3. Sum the samples according to the formula, u[n] = ∑ Z [n + 128 j ] , n=0 to 127, to create the j =0. 128-element array u. 4. Calculate 64 new subband samples by the matrix operation X=Mu, where.  iπ (k + 0.5)(2n − 1)   0 ≤ k < 64, M (k , n) = exp ,  128   0 ≤ n < 128. X (k,j) corresponds to the jth subband sample of the kth QMF subband. In the equation, exp() denotes the complex exponential function and i is the imaginary unit.. 640 Window coefficients. Magnitude. 1. 0.5. 0. -0.5. 0. 100. 200. 300. 400. 500. 600. 500. 600. Index 2. (Window coefficients) 0.8. Power. 0.6. 0.4. 0.2. 0. 0. 100. 200. 300. 400. Index. Figure 3.9: Coefficients of the QMF bank window. Every loop produces 64 complex-valued subband samples, representing the output from one subband. For every frame, the filter bank produces 32 subband samples for every subband, corresponding to a time domain signal of length 2048 samples. The magnitude responses of the first 4 frequency bands (k=0… 3) of the QMF analysis bank are illustrated by Figure 3.10. 20.

(33) Figure 3.10: Magnitude responses for the first 4 band of the QMF analysis filter bank. (The magnitude for k=0 is highlighted). 3.2.5 Hybrid Filterbank for improved frequency resolution At a sampling rate of 44.1 kHz, the 64-bands analysis filter bank results in an effective bandwidth of approximately 344 Hz, which is considerably wider than the required spectral resolution at low frequencies. In order to further improve the frequency resolution, the lower QMF subbands are further split with an additional filter bank based on oddly-modulated Qth band filter banks. Depending on the QMF subband, two types of filter have been defined.. Table 3.1: Overview of low frequency split QMF subband p. Number of bands Qp. Filter. 0. 8. Type A. 1. 2. Type B. 2. 2. 2π 1 (q + )(n − 6)), p 2 Q 2π TypeB : Gqp = g p [n] cos( p q (n − 6)), Q TypeA :Gqp = g p [n] exp( j. 21.

(34) where g p represents the 12-order prototype filters in QMF subband p, Q p the total number of sub-subbands in subband p, q the sub-subband index in QMF channel p and n the time index. According to Table 3.1, the first three QMF bands perform sub-subband filtering with Q0=8, Q1=2, and Q2=2. The remaining QMF subbands that are not filtered are delay compensated. This delay amounts to 6 QMF subband samples (i.e. G0k ( z ) = z −6 for k= [3…63]). Besides, in order to further reduce the complexity of this configuration, some of the filter bank outputs have been summed and resulting total 71 output bands. As an example, the magnitude response of the 8-band sub-filter bank is given in Figure 3.11.. Figure 3.11: Magnitude response of the 8-band sub-filter bank. (subband q=0 is highlighted). 3.2.6 Subband Partition To reduce the complexity and the bitrate of spatial parameters, 71 subband signals are grouped into fewer parameter bands. Since the spatial parameters vary over time and frequency, the psychoacoustic data indicate that a Bark or Equivalent Rectangular Bandwidth (ERB) like frequency scale is appropriate for spatial parameters in the following equation: ERB( f ) = 24.7(0.00437 f + 1) , where f is the center-frequency in Hz. Hence, the subband coefficients are non-uniformly. 22.

(35) divided into 20 individual parameter bands according to such a perceptual frequency scale. In each parameter band one set of spatial parameters will be separately estimated and transmitted to the decoder.. 3.2.7 Spatial Audio Parameters The Spatial Audio Coding systems employs two conceptual modules with which it is possible to describe any arbitrary mapping from M to N channels and back, with N<M. The structure of the system divides the input channels into pairs of channels that are coded with modules which take two input channels, and produces one output channel (a Reverse One-To-Two module, R-OTT). However, there are also modules that take three input channels and produce two output channels are available (a Reverse Two-To-Three module, R-TTT).. 3.2.7.1 R-OTT module. The purpose of the R-OTT module is to create a mono downmix from a stereo input and extract the relevant spatial parameters. For each frequency band, two parameters are computed (assuming input signals x1 and x2).. . Channel Level Differences (CLD) – They represent the power ratio of corresponding time/frequency tiles of the input signals, given by:.  ∑∑ x1n, m x1n ,m*    CLD = 10 log10 n m n, m n ,m* .  ∑∑ x 2 x 2   n m  . Inter-channel coherence/cross-correlation (ICC) – They represent the similarity measure of the corresponding time/frequency tiles of the input signals, given by:.   ICC = Re  .   n m . n , m n , m* n , m n , m* x x x x  ∑∑ ∑∑ 1 1 2 2 n m n m . ∑∑ x. n,m 1. x 2n ,m*. 23.

(36) 3.2.7.2 R-TTT module. The R-TTT module performs a downmix from three (l, c, r) to two (l0, r0) downmix channels, combined with the generation of the associated spatial parameters. It is appropriate for modeling the symmetrically downmixed center from a stereo downmix pair. The TTT module has two modes of operation.. TTT Energy Mode The first method comprises an energy-based parameterization, using channel level difference (CLD) parameters..  ∑∑ l1n,m l1n ,m* + ∑∑ r1n,m r1n,m*    n m CLD1 = 10 log10 n m . n ,m n ,m* c2 c 2 ∑∑   n m    ∑∑ l1n,m l1n ,m*    CLD2 = 10 log10 n m n,m n ,m* .  ∑∑ r2 r2   n m  This method aims at providing the desired output energy ratios of the signals l, r, and c represented by the energy ratios CLD1 and CLD2.. TTT Prediction Mode A second mode of operation for the R-TTT module is based on transmission of up-mix matrix directly. It makes use of the following parameters.. . Channel Prediction Coefficients (CPC) – The general purpose of the TTT module at the decoder side is to generate three signals from the transmitted stereo downmix signal pair. If the up-mix process from the stereo pair (l0, r0) to the signal triplet (l’, r’, c’) is written in matrix notation, the up-mix matrix MCPC is given by:. l '  l 0   ' r M = CPC r     0 ' c    24.

(37) M CPC. c1 + 2 c2 − 1  1 =  c1 − 1 c2 + 2 2 1 − c1 1 − c2 . Here, c1 and c2 represent the transmitted CPC parameters. The CPC parameters aim at an optimum reconstruction of the spatial parameters of the signals l’, r’, and c’ (compared to the spatial parameters of the corresponding signals at the encoder side). In other words, given the up-mix matrix MCPC, the parameters c1 and c2 can be optimized according to several optimization criteria. However, the recovered signal will in general consist of only partially correlated signals. Therefore, there will be a prediction loss.. . Inter-channel coherence/cross-correlation (ICC) – Unlike the R-OTT module, the ICC parameter here describes the prediction loss cause by the CPC parameters. It is the power ratio between the R-TTT module input and reconstructed signals:. ICC =. l ⋅ l * + r ⋅ r * + c ⋅ c* l ' ⋅ l '* + r ' ⋅ r '* + c ' ⋅ c '*. Here, <.>denotes the expected value operator, and * denotes complex conjugation.. 3.2.7.3 Hierarchical Configuration. To encode 5.1 surround sounds into two-channel stereo in particularly attractive in view of its backward compatibility with the existing stereo consumer devices. Figure 3.12(a) shows a block diagram of a 5.1-to-2 encoder for such a typical system consisting of three R-OTT and an R-TTT module. The signals L, Ls, C, LFE, R and Rs denote the left front, left back, center, LFE, right front and right back channels, respectively. Another example is illustrated in Figure 3.12(b), which shows how the R-OTT modules can be connected in a tree structure, forming a 5.1-to-1 encoder.. 25.

(38) (a). (b). Figure 3.12: (a) 5.1-to-2 MPEG Surround encoder (b) 5.1-to-1 MPEG Surround encoder. 3.3 Combination of MPEG Surround and HE-AAC As mention previously, the MPEG Surround coder can be connected to a state-of-the-art coder. Since the MPEG Surround coder interfaces to the downmix channels by means of a QMF-domain representation, identical to that standardized in the SBR tool of MPEG-4 HEAAC. In the case that the spatial coder is combined with HEAAC, this QMF representation is directly available as an intermediate signal in the HEAAC coder. Figure 3.13 shows the block diagram of MPEG Surround-HEAAC encoder. The output signals of the hybrid synthesis filterbank are directly fed to the SBR tool.. HEAAC Encoder. QMF Analysis. Hybrid Analysis. Downmix. Hybrid Synthesis. Spatial Parameter Estimation. QMF Synthesis. (QMF domain data). AAC Encoder. AAC data. SBR Encoder. SBR data. Spatial parameter. Figure 3.13: Interconnection of MPEG Surround and HEAAC encoder. 26.

(39) Chapter 4 DSP Implementation Environment. We select the TI DSP platform to implement the MPEG Surround encoder. The DSP baseboard (SMT395) is made by Sundance which houses Texas Instruments’ TMS320C6416T DSP chip and Xilinx Virtex-II Pro FPGA. Because our implementation is mainly in software, the discussions in this chapter focus on the DSP system environment, DSP chip and its features. Then, the software development tool, Code Composer Studio (CCS), is introduced. At the end, some important acceleration techniques and features which can reduce stalls or hazards on DSP system are also included.. 4.1 DSP Baseboard (SMT395) The block diagram of the Sundance DSP baseboard system (SMT395) is shown in Figure 4.1 [18]. SMT395 utilizes the signal processing technology to provide extreme processing flexibility and high performance. Some important features of SMT395 are listed as follows.. . 1GHz TMS320C6416T fixed point DSP processor with L1, L2 cache and SDRAM.. . 8000MIPS peak performance.. . Xilinx Virtex-II Pro FPGA. XC2V920-6 in FF896 package.. . Two Sundance High Speed Bus (100MHz, 200MHz) ports which is 32 bits wide.. . Eight 2Gbit/sec Rocket Serial Links(RSL) for interModule.. . 8 MB flash ROM for configuration and booting.. . Six common ports up to 20 MB per second for inter DSP communication.. . JTAG diagnostics port.. 27.

(40) JTAG Header. 8 differential serial links. FPGA Controller Virtex-II Pro 1.5V core 3.3V I/O. J3 Global Expansion Connector. McBSP/Utopia/GPIO (all non-TIM I/O pins) 4 LCDs 4 LEDs 4 I/O pins 104 I/O pins 133 MHz EMIFA. 4x Comm-ports. 2x Sundance RSL Connector (Xilinx Rocket-IO) 28-way Samtec QxE. 120 I/O pins, 32-bit Data. Global Bus. 2x Sundance High Speed Bus (SHB) 60-way Samtec QSH. Clock, timer, interrupt. 2x Comm-ports. J1 Top primary TIM Connector Comm-port 0 & 3. DSP C6416T 512pins, 1.4V. Oscillators. 256Mbytes SDRAM (4×32M×16 133MHz) EMIFA 8Mbyte Flash. J2 Bottom primary TIM Connector Comm-port 1, 2, 4 & 5. Figure 4.1: The block diagram of the Sundance DSP Baseboard system [18]. 4.2 DSP chip The TMS320C6416T DSP is using the VelociTI.2 architecture [19]. The VelociTI.2 is a high performance, advanced very long instruction word (VLIW) architecture, making it an excellent choice for multi-channel, multi-functional, and performance-driven applications. VLIW architecture can achieve high performance through increased instruction-level parallelism, and perform multiple instructions during a single cycle. Because parallelism takes the DSP well beyond the performance capabilities of traditional superscalar systems, it is the key to high performance. The TMS320C6416T DSP chip is the highest-performance fixed-point DSP generation in the TMS320C64x series. According to [19], TMS320C64x series is also a member of the TMS320C6000 (C6x) family. The block diagram of the C6000 family is shown in Figure 4.2. The detailed features of the C6x family devices include:. 28.

(41) . Advanced VLIW DSP core. . Eight independent functional units, including two multipliers and six arithmetic units (ALU).. . 64 32-bit general-purpose registers. . Instruction packing to reduce code size, program fetches, and power consumption.. . Conditional execution of all instructions.. . Non-aligned Load and Store architecture. . Byte-addressable (8/16/32/64-bit data), providing efficient memory support for a variety applications.. . 8 bit overflow protection. C62x/C64x/C67x device Program cache/program memory 32-bit address 256-bit data C62x/C64x/C67x CPU Power down. DMA, EMIF. Program fetch Instruction dispatch Instruction decode Data path A Register A. Data path B Register B. .L1 .S1 .M1 .D1. .D2 .M2 .S2 .L2. Data cache/data memory 32-bit address 8-, 16-, 32-bit data (64-bit data, C64x only). Control registers Control logic Test Emulation Interrupts. Additional peripherals: Timers, serial ports, etc.. Figure 4.2: Block diagram of TMS320C6x DSP [23]. Peripherals such as enhanced direct memory access (EDMA) controller, power-down logic, and two external memory interfaces (EMIFs) usually come with the CPU, while peripherals such as serial ports and host ports are on only certain devices. In the following. 29.

(42) sections, three major parts of C64x DSP chip are introduced respectively. They are central processing unit (CPU), memory, and peripherals.. 4.2.1 Central Processing Unit (CPU) Besides eight independent functional units and sixty-four general purpose registers, the C64x CPU consists of the program fetch unit, instruction dispatch unit, instruction decode unit, two data path, interrupt logic, several control registers and two register files. The DSP chip architecture is illustrated in Figure 4.3. The instruction dispatch and decode units could decode and arrange the eight instructions to eight function units respectively. The eight function units in the C64x architecture could be further divided into two data paths, A and B. Each path consist four functional units, including one for multiplication operations (.M), another one for logical and arithmetic operations (.L), another one for branch, bit manipulation, and arithmetic operations (.S), and another one for loading/storing, address calculation and arithmetic operation (.D).. C64x CPU. Dual 64 bits load/store paths Figure 4.3: C64x DSP chip architecture [23]. 30.

(43) Two cross-paths (1x and 2x) allow functional units from one data path to access and 32-bit operand to the register file on the other side. There can be a maximum of two cross-path source reads per cycle. There are 32 general purpose registers in each data path, and some of them are reserved for specific addressing or used for conditional instructions. Each functional unit has its own 32-bit bus for writing into a general purpose register file.. 4.2.2 Memory and Peripherals The C64x uses two-level cache-based memory architecture [21] and has a powerful set of peripherals. Level 1 cache is split into level 1 program (L1P) cache and level 1 data (L1D) cache. The size of each L1 cache is 16kB. The level 2 memory/cache (L2) consists of a 1MB memory space and can be optionally split into cache (up to 256 KB) and L2 SRAM (addressable on-chip memory). Besides, it also has one external memory which is a 256 MB SDRAM operated at 133 MHz. C64x DSP chips also contain some peripherals for supporting with off-chip memory options, co-processors, host processors, and serial devices. The peripherals are enhanced direct memory access (EDMA) controller, Host-Port interface (HPI), three 32-bit general purpose timers, IEEE-1149.1 JTAG interface and etc. The EDMA controller transfers data between regions in the memory map without the intervention by CPU. It could move the data from internal memory to external memory or from internal peripherals to external devices. The HPI used for communication between the host PC and the target DSP. It is a 16/32-bit wide parallel port through which a host processor could directly access the CPU’s memory space. The host can direct access to memory-mapped peripherals and has ease of access. The C64x has three 32-bit general-purpose timers that are used to time events, count events, generate pulses, interrupt the CPU and send synchronization events to the DMA controller. The timer could be clocked by an internal or an external source.. 31.

(44) 4.3 TI DSP Code Development Environment In this section, we will give a briefly introduction about the coding development environment in this project. The Code Composer Studio (CCS) and the coding development flow are illustrated.. 4.3.1 Code Composer Studio (CCS) The Code Composer Studio (CCS) is a software integrated development environment (IDE) for building and debugging programs. We briefly describe some of its features related to our implementation below. The details can be found in [20].. . Real time analysis. . Compile codes and generate Common Object File Format (COFF) output file.. . Provide debug options such as step over, step in, step out, run free, and so on.. . Watches any memory sections when the DSP halts.. . Chip support libraries (CSL) to simplify device configuration.. . Provide debug options such as step over, step in, step out, run free, and so on.. . Support optimized DSP functions such as FFT, filtering, convolution.. . Count the instruction cycles between successive profile-points.. . Arrange code/data to different memory space by linker command file. . Probes a PC file stream into or from the target memory location. We mainly use the CCS tool for debugging, refining, optimizing, and implementing our C codes on DSP. The profiling function helps us to evaluate if our changes to the codes are better or not.. 32.

(45) 4.3.2 Code Development Flow The DSP code development can be divided into three steps.. . Step1: Develop the C code without any regard to the particular structure of the C64x. Then, use the debugger to profile the code to identify any inefficient sections in the code. If the performance is not satisfactory, go to step2.. . Step2: Use DSP intrinsic, shell options, and coding techniques for code generation to improve the C codes. Refine the C code procedures such as compiler options, intrinsics, statement, data type modifiers, and code transformations. If the code efficiency is still not sufficient, proceed to step3.. . Step3: Extract the most time-critical areas and rewrite the code in linear assembly. Then, use assembly optimizer to optimize the code.. In general, we do not go to setp3 because the linear assembly is too detail and it takes much more time than in step2. Figure 4.4 shows the steps of the software development flow [20].. Write C Code. Refine C Code. Assembly Code. Compiler. Compiler. Assembler. Profiler. Profiler. Profiler. Efficient?. Yes. Efficient?. Yes. Efficient? No. No. Complete. No Yes. Phase 1 Develop C Code. Complete. Yes Complete. More C Optimization? Phase 2 Refine C Code. Figure 4.4: Code development flow [20]. 33. Phase 3 Write linear assembly.

(46) 4.4 DSP Code Acceleration Methods In this section, we will describe several methods that can accelerate our code and reduce the execution time on the C64x DSP. Some of these methods are supported by the features of C64x DSP system.. 4.4.1 Compiler Option Setting The CCS compiler translates the source program in more efficient assembly code, and it supports several options to optimize the code. The C/C++ complier is able to perform various optimizations to reduce code size and improve execution performance. The easiest way to invoke optimization is to use the cl6x shell program, specifying the -on option on the cl6x command line, where n denotes the level of optimization (0, 1, 2, 3) which controls the type and degree of optimization:. Table 4.1: Options that control optimization [22] Optimization Level. Description. -o0. . Performs control-flow-graph simplification. . Allocates variables to registers. . Performs loop rotation. . Eliminates unused code. . Simplifies expressions and statements. . Expands calls to functions declared inline. (Register). -o1. Performs all -o0 optimization, and:. (Local). . Performs local copy/constant propagation. . Removes unused assignments. . Eliminates local common expressions. -o2. Performs all -01 optimizations, and:. (Function). . Performs software pipelining. . Performs loop optimizations. . Eliminates global common sub-expressions. . Eliminates global unused assignments. . Converts array references in loops to incremented pointer form. 34.

(47) . Performs loop unrolling. -o3. Performs all -o2 optimizations, and:. (File). . Removes all functions that are never called. . Simplifies functions with return values that are never used. . Inline calls to small functions. . Reorders function declarations so that the attributes of called functions are known when the caller is optimized. . Propagates arguments into function bodies when all calls pass the same value in the same argument position. . Identifies file-level variable characteristics. 4.4.2 Fixed-point Coding The C64x DSP is a fixed-point processor, so it can perform fixed-point operations only. Although C64x DSP can simulate floating-point processing, it takes a lot of extra clock cycle to perform the same operation. Table 4.2 is the test results of C64x DSP processing time of instructions “add” and “mul” for different datatypes. The “char”, “short”, “int”, and “long” are fixed-point data types, and the “float” and “double” are floating-point data types. We can see clearly that the floating-point operations need more than 10 times execution cycles comparing to the fixed-point operations. In order to accelerate our program on the C64x DSP, it is necessary to convert the datatypes from floating-point to fixed-point. However, this modification may cause the data losing some accuracy.. Table 4.2: Processing time on the C64x DSP for different data types Assembly. char. short. int. long. float. double. Instruction. 8-bit. 16-bit. 32-bit. 40-bit. 32-bit. 64-bit. add. 1. 1. 1. 2. 77. 146. mul. 2. 2. 6. 8. 54. 69. 35.

(48) 4.4.3 Loop Unrolling Loop unrolling expands the loops so that all iterations of the loop appear in the code. It often increases the number of instructions available to execute in parallel. If the codes have conditional instructions, sometimes the compiler does not sure that the branch will be happen or not. It takes more clock cycles to wait for the decision of branch operation. Thus, we can unroll the loop to avoid some of the overhead for branching. Example 4.1 is the loop unrolling and Table 4.3 shows the cycles and code size.. (a) /*Before unrolling*/. (b) /*After unrolling*/. int i,a=0,b=0; for (i=0;i<8;i++) { a+=i; b+=i; }. int i=0,a=0,b=0; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; Example 4.1 Loop unrolling.. Table 4.3: Comparison between Rolling and Unrolling. Execution Cycles Code Size. (a)Before Unrolling (b)After Unrolling 436 206 116 479. 4.4.4 Intrinsics and Packet Data Processing The TI C6000 compiler provides many special functions that map C codes directly to inlined C64x instructions, to increase C code efficiently. These special functions are called intrinsics. Figure 4.5 shows some examples of the intrinsic functions for the C6000 DSP. The entire list of intrinsics for the C6000 DSP can be found in [20]. 36.

(49) Figure 4.5: Intrinsic functions of the TI C6000 series DSP (partial list) [20]. Use a single load or store instruction to access multiple data that consecutively located in memory in order to maximize data throughput. It is so called the single instruction multiple data (SIMD) method. For example, if we can place four 8-bit data or two 16-bit data in a 32-bit space, we may do four or two operations in one clock cycle. If we use the SIMD method, then we can improve the code efficiency substantially. Some intrinsic functions enhance the efficiency in a similar way.. 4.4.5 Register and Memory Arrangement When the accessed data are located in the external memory, it may need more clock cycles to transfer data to CPU. We can use registers to store data in order to reduce transfer time in operation. In DSP code, the variables, pointer, malloc functions, C codes and so on will locate data in memory. We can arrange the link.cmd file which is the memory allocation file. We arrange different type of data in different memory space because of acceleration consideration. It also provides the “CODE_SECTION” and”DATA_SECTION” key words which can allocate parts of C code or data in the internal memory in order to speed.. 37.

(50) 4.4.6 Using Marcos Because the software-pipelined loop can not contain function calls, it takes more clock cycles to complete the function call. Hence, we can change the functions to the “define” macros under some conditions. In addition, replacing the function with the macro can cut down the code for initial function definition and reduce the number of branches. However, macros are expanded each time they are called if the function has a number of instructions, so it is not efficient in memory usage.. 4.4.7 Linear Assembly When we are not satisfied with the efficiency of assembly codes which are generated by the TI CCS compiler, we can replace some codes by the assembly codes and then optimize the assembly directly. But this process generally is too detail and very time consumption in practice. Hence, we will do this process at the last step if we have very strict constrains in the processing performance and we have no other algorithms choice.. 38.

(51) Chapter 5 Acceleration of MPEG Surround Encoder on DSP Platform In this chapter, we present several acceleration methods for the MPEG Surround encoder. We will first introduce the MPEG Surround reference software. By analyzing the complexity of the reference encoder, we determine which parts are required to speed up. Then, we propose several methods to reduce the computing time. After proposing the algorithms, we also implement a MPEG Surround-HEAAC encoder on the DSP platform. Finally, we will show the acceleration results.. 5.1 Software Environment 5.1.1 MPEG Surround RM0 Reference Software The MPEG Surround group of the MPEG standard committee provides a piece of reference software [25]. It is written in the C programming language for the codec specified by ISO/IEC 23003-1. It was originally developed by Agere Systems, Coding Technologies, Fraunhofer IIS, and Philips. We will introduce the MPEG Surround reference encoder in the next section.. 5.1.2 Command Line Switches In the reference software, there are two command line arguments that control the encoder configuration. The first argument is Tree-Config, which defines the tree structure configuration according to Table 5.1. The 5151 and 5152 configurations make use of five 39.

(52) R-OTT modules to produce a mono downmixed output. The encoder with the 525 configuration consists of three R-OTT modules and an R-TTT module, and produces a stereo output.. Table 5.1: Tree structure of the MPEG Surround reference software encoder [10] 5151 Tree Structure. 5152 Tree Structure. 525 Tree Structure. 40.