MPEG-4先進音訊編解碼器之增速及其在DSP平台上的實現

全文

(1)國立交通大學電子工程學系碩. 士. 電子研究所碩士班論. 文. MPEG-4 先進音訊編解碼器之增速及其在 DSP 平台上的實現. MPEG-4 AAC Codec Acceleration and DSP Implementation. 研究生: 王盈閔指導教授: 杭學鳴博士. 中華民國九十四年六月.

(2) MPEG-4 先進音訊編解碼器之增速及其在 DSP 平台上的實現 MPEG-4 AAC Codec Acceleration and DSP Implementation 研究生: 王盈閔. Student: Yin-Ming Wang. 指導教授: 杭學鳴博士. Advisor: Dr. Hsueh-Ming Hang. 國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文 A Thesis Submitted to Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Master of Science in Electronics Engineering June 2005 Hsinchu, Taiwan, Republic of China. 中華民國九十四年六月.

(3) MPEG-4 先進音訊編解碼器之增速及其在 DSP 平台上的實現學生：王盈閔. 指導教授：杭學鳴博士國立交通大學. 電子工程學系電子研究所碩士班. 摘要 MPEG-4 先進音訊編碼(AAC)是非常有效率的音訊壓縮編碼技術。它是由 ISO/IEC MPEG 所制定的一套標準。在本篇論文當中，我們首先分析 MPEG-4 先進音訊編碼器在 DSP 上的執行計算複雜度。發現心理聲學模式(psychoacoustic model)和量化及位元編碼(bit allocation)所花的執行時脈週期為最多，因此針對它們，我們在 DSP 上的實現利用比較快速的演算法主要加速之。在 DSP 實現方面，為了加速先進音訊編碼器，我們針對 DSP 的架構使用了一些程式技巧，包括定點式資料型態、TI DSP 的特殊指定群等等。除此之外，我們也參考了一些快速運行的演算法，並套用在原來的音訊編碼器之心理聲學模式及量化位元編碼上。經由這些的程式修改，最後的編碼器版本在 DSP 上的執行速度比原來的有了 77.89%的改善幅度。並且我們也成功的把先進音訊編碼器及解碼器兩者實現在 II(Innovative Integration) 所提供的 Quixote DSP 平台上。而在主端及客端的傳輸介面，我們採用了緩衝之區塊傳輸模式，此模式讓我們容易實現整個架構。最後經由我們的加速及系統實現，此先進音訊編碼及解碼器各自都可達到即時編解碼的效果。. i.

(4) MPEG-4 AAC Codec Acceleration and DSP Implementation Student: Yin-Ming Wang. Advisor:Dr. Hsueh-Ming Hang. Department of Electronics Engineering Institute of Electronics National Chiao Tung University. Abstract MPEG-4 AAC (Advanced Audio Coding) is an efficient audio coding standard. It is defined by the ISO/IEC (International Standard Organization) MPEG (Moving Pictures Experts Groups) committee. In this thesis, we first analyze the computational complexity of the MPEG-4 AAC encoder program. We find that the PAM (psychoacoustic model) and the quantization and bit allocation module require the most execution cycles on DSP. Hence, we mainly propose methods to accelerate them on DSP. In order to speed up the AAC encoder on DSP, we use several DSP codes acceleration techniques including fixed-point data types, TI (Texas Instruments) DSP intrinsic functions and others. In addition, we accelerate the PAM and the quantization and bit allocation modules by fast algorithms for DSP implementation. Through these modifications, the final AAC encoder version has about 77.89 percent improvement. Furthermore, we also successfully implement both the AAC decoder and encoder on the II’s (Innovative Integration) Quixote DSP board. We adopt the burst block transmission mechanism for communication between the host and the target side. Finally, the speed of the AAC encoder and decoder on DSP implementation can achieve real-time operation.. ii.

(5) 誌謝能夠完成這篇論文，最感謝的是我的指導教授－杭學鳴老師，在這兩年的研究生涯，老師不但給予我專業的知識與指導，並且培養了我認真踏實的研究態度，讓我獲益良多。實驗室完善的設備及認真的環境，也幫助我能夠順利的完成這篇論文，我要感謝實驗室的學長、同學及學弟，特別是楊政瀚、陳繼大、吳俊榮學長給予我很多研究的建議，我也要感謝曾建統學長，他在音訊編碼方面做了很多的研究，讓我更加容易的實現於硬體平台上，還有實驗室一同奮鬥的志楹、景中、昱昇、朝雄、漢光等同學們，讓我在研究過程中遇到困難時，能夠互相討論和砥礪。另外我也要感謝我的朋友給予我生活上的支持及勉勵，使我充滿信心。最後，我要感謝我的父母及家人，沒有你們的栽培與鼓勵，我無法有今天的成就。要感謝的人很多，沒辦法一一詳列，在此，謹以這篇論文，獻給在研究生涯中所有關心以及幫助我的人，謝謝你們。. 王盈閔民國九十四年六月. iii.

(6) List of Figures. Fig. 2.1 Block diagram for MPEG-2 AAC encoder.......................................................5 Fig. 2.2 Block diagram of psychoacoustic model..........................................................7 Fig. 2.3 Block diagram of gain control tool...................................................................8 Fig. 2.4 Window shape adaptation process....................................................................9 Fig. 2.5 Block switching during transient signal conditions........................................10 Fig. 2.6 Prediction tool for one scalefactor band ......................................................... 11 Fig. 2.7 Block diagram of MPEG-4 GA encoder.........................................................14 Fig. 2.8 LTP in the MPEG-4 General Audio encoder ..................................................15 Fig. 2.9 Principle of Perceptual Noise Substitution.....................................................16 Fig. 2.10 TwinVQ quantization scheme ......................................................................17 Fig. 3.1 Innovative Integration's Quixote DSP Baseboard Cark .................................22 Fig. 3.2 Block Diagram of Quixote .............................................................................22 Fig. 3.3 Block diagram of TMS320C6x DSP ..............................................................23 Fig. 3.4 The TMS320C64x DSP Chip Architecture and Comparison with Ancient TMS320C62x/C67x Chip ............................................................................................24 Fig. 3.5 TMS320C64x CPU Data Path........................................................................26 Fig. 3.6 Functional Units and Operations Performed ..................................................27 Fig. 3.7 Functional Units and Operations Performed (Cont.)......................................28 Fig. 4.1 Code development flow of C6000..................................................................35 Fig. 4.2 Intrinsic functions of the TI C6000 series DSP (partial list) ..........................40 Fig. 4.3 Block diagram of original PAM .....................................................................43 Fig. 4.4 Block diagram of proposed PAM ...................................................................44 Fig. 4.5 Requantization operation with three ranges ...................................................48 Fig. 4.6 Block diagram of bit allocation ......................................................................50 Fig. 4.7 Flow chart of the bit allocation algorithm ......................................................52 Fig. 5.1 Structure of AAC decoder implementation on DSP .......................................58 Fig. 5.2 Structure of AAC encoder implementation on DSP .......................................61. iv.

(7) List of Tables. Table 4.1 Profile of AAC encoder on C64x DSP.........................................................36 Table 4.2 Compiler Options for Performance Enhancement .......................................37 Table 4.3 Compiler Options to Avoid on Performance Enhancement .........................37 Table 4.4 Processing time on the C64x DSP for different data types ..........................38 Table 4.5 Comparison between with unrolling and without unrolling ........................39 Table 4.6 The acceleration result of the PAM in the AAC encoder .............................46 Table 4.7 The ODG of test sequence “guitar” .............................................................46 Table 4.8 The ODG of test sequence “organ”..............................................................46 Table 4.9 The ODG of test sequence “eddie_rabbitt”..................................................46 Table 4.10 The acceleration result of the Requantization in the AAC encoder ...........48 Table 4.11 The acceleration result of the bit allocation in the AAC encoder ..............53 Table 4.12 The ODG of test sequence “guitar” ...........................................................53 Table 4.13 The ODG of test sequence “organ”............................................................53 Table 4.14 The ODG of test sequence “eddie_rabbitt”................................................54 Table 4.15 The final acceleration result of the AAC encoder......................................55 Table 4.16 Profile of final modified AAC encoder on C64x DSP...............................55 Table 4.17 Improvement of each part in AAC encoder on C64x DSP ........................55 Table 4.18 The ODG of test sequence “guitar” ...........................................................55 Table 4.19 The ODG of test sequence “organ”............................................................56 Table 4.20 The ODG of test sequence “eddie_rabbitt”................................................56 Table 4.21 The ODG of test sequence “TS_01” ..........................................................56 Table 4.22 The ODG of test sequence “TS_02” ..........................................................56 Table 4.23 The ODG of test sequence “TS_03” ..........................................................56 Table 5.1 Implementation result of AAC decoder on DSP ..........................................59 Table 5.2 Implementation result of AAC encoder on DSP ..........................................62. v.

(8) Contents. 中文摘要.........................................................................................................................i Abstract ..........................................................................................................................ii 致謝.............................................................................................................................. iii List of Figures ...............................................................................................................iv List of Tables..................................................................................................................v Chapter 1 Introduction ...................................................................................................1 Chapter 2 MPEG-2/4 Advanced Audio Coding.............................................................3 2.1 MPEG-2 AAC..................................................................................................3 2.1.1 Psychoacoustic Model ..........................................................................5 2.1.2 Gain Control..........................................................................................7 2.1.3 Filterbank ..............................................................................................8 2.1.4 Prediction ............................................................................................10 2.1.5 Temporal Noise Shaping (TNS).......................................................... 11 2.1.6 Joint Stereo Coding.............................................................................11 2.1.7 Quantization........................................................................................12 2.1.8 Noiseless Coding ................................................................................13 2.2 MPEG-4 AAC Version 1................................................................................13 2.2.1 Long Term Prediction (LTP)...............................................................15 2.2.2 Perceptual Noise Substitution (PNS) ..................................................16 2.2.3 TwinVQ...............................................................................................16 2.3 MPEG-4 AAC Version 2................................................................................17 2.3.1 Error Robustness.................................................................................17 2.3.2 Low-Delay Audio Coding...................................................................18 2.3.3 Fine Grain Scalability .........................................................................19 2.3.4 Parametric Audio Coding....................................................................19 2.3.5 CELP Silence Compression ................................................................19 2.3.6 Extended HVXC .................................................................................20 Chapter 3 DSP Implementation Environment .............................................................21 3.1 DSP Baseboard ..............................................................................................21 3.2 DSP Chip........................................................................................................23 3.2.1 Central Processing Unit (CPU)...........................................................24 vi.

(9) 3.2.2 Data Path.............................................................................................26 3.2.3 Memory...............................................................................................28 3.2.3.1 Internal Memory ......................................................................28 3.2.3.2 External Memory and Peripheral Options ...............................28 3.3 Data Transmission Mechanism ......................................................................29 3.3.1 DSP Streaming Interface.....................................................................30 3.3.2 Burst Block Transmission ...................................................................30 3.3.3 Message Exchange..............................................................................31 Chapter 4 MPEG-4 AAC Encoder Acceleration on DSP ............................................33 4.1 TI's Code Development Environment............................................................33 4.1.1 The Code Composer Studio ................................................................33 4.1.2 Code Development Flow ....................................................................34 4.2 Profile of AAC on DSP..................................................................................36 4.3 DSP Code Acceleration Methods...................................................................36 4.3.1 Setting of Compiler Options ...............................................................36 4.3.2 Fixed-point Coding .............................................................................38 4.3.3 Loop Unrolling....................................................................................38 4.3.4 Using Intrinsics ...................................................................................40 4.3.5 Packet Data Processing .......................................................................40 4.3.6 Register and Memory..........................................................................41 4.3.7 Using Macros ......................................................................................41 4.3.8 Linear Assembly .................................................................................41 4.4 Psychoacoustic Model ...................................................................................42 4.4.1 Optimization of PAM..........................................................................42 4.4.2 Simulation Results on DSP.................................................................45 4.5 Quantization and Bit Allocation.....................................................................46 4.5.1 A High Quality Requantization Method .............................................47 4.5.2 Simulation Results on DSP.................................................................48 4.5.3 Single Loop Distortion Control Algorithm .........................................49 4.5.4 Simulation Results on DSP.................................................................52 4.6 The Final Simulation and Acceleration Results on TI C64x DSP .................54 Chapter 5 MPEG-4 AAC Codec Implementation on DSP ..........................................57 5.1 AAC Decoder Implementation on DSP .........................................................57 5.1.1 Structure of AAC Decoder Implementation........................................58 5.1.2 Implementation Results of AAC Decoder ..........................................59 5.2 AAC Encoder Implementation on DSP .........................................................60 5.2.1 Structure of AAC Encoder Implementation........................................60 5.2.2 Implementation Results of AAC Encoder...........................................61 vii.

(10) Chapter 6 Conclusions and Future Work .....................................................................63 6.1 Conclusions....................................................................................................63 6.2 Future Work ...................................................................................................64 Bibliography ................................................................................................................65. viii.

(11) Chapter 1 Introduction MPEG stands for ISO “Moving Pictures Experts Groups.” It is a group work under the directives of the International Standard Organization (ISO) and the International Electro-technical Commission (IEC). This group work concentrates on defining the standards for coding moving pictures, audio and related data. The MPEG-4 Advanced Audio Coding (AAC) is an efficient audio algorithm standardized by ISO/IEC MPEG committee. The AAC can achieve indistinguishable quality at 128 kbits/s for stereo signals, and at 320 kbits/s for 5.1 multichannel audio. Hence, it can compress audio data at high quality with high compression efficiency. The MPEG-4 AAC mainly inherits MPEG-2 AAC (13818-7) and adds several tools to enhance the coding performance, such as temporal noise shaping (TNS), perceptual noise substitution (PNS), long time prediction (LTP), spectral band replication (SBR) and others. In this thesis, our aim is to implement the MPEG-4 AAC encoder and decoder on the DSP processor. Hence, we adopt the DSP board made by Innovative Integration's Quixote to implement our program. The board houses a Texas Instruments' TMS320C6416 DSP and a Xilinx Virtex-II FPGA. The TI TMS320C6416 fixed-point processor has a rather good performance. Its instruction cycle frequency is 600MHz. It adopts the advanced VelociTI very long instruction word (VLIW) architecture that can execute eight instructions in parallel. In addition, we accelerate the MPEG-4 AAC encoder by some DSP coding techniques and several efficient algorithms. Our contributions are the acceleration of the AAC encoder and the implementation of the AAC encoder and decoder. Through some DSP codes acceleration techniques and the fast algorithms of the PAM (psychoacoustic model) and the quantization and bit allocation modules in AAC encoder, the final AAC encoder version has about 77.89 1.

(12) percent improvement. Furthermore, the speed of the AAC encoder and decoder on DSP implementation can achieve real-time operation. This thesis is organized as follows. In chapter 2, we describe operations of MPEG-2 AAC and MPEG-4 AAC. In chapter 3, we describe the DSP development environment and the communication interface provided by the DSP platform. In chapter 4, we speed up the AAC encoder program on DSP. In chapter 5, we successfully implement the AAC encoder and decoder on DSP platform. Finally, we give a conclusion and future work of our system.. 2.

(13) Chapter 2 MPEG-2/4 Advanced Audio Coding In this chapter, we will briefly introduce several basic concepts and major modules of the MPEG-2/4 AAC (Advanced Audio Coding) system. Details can be found in [1] and [2] respectively.. 2.1 MPEG-2 AAC In 1994, the MPEG-2 audio standardization committee defined a high quality multi-channel standard. It was the first-step of the development of “MPEG-2 AAC”. In 1997 April, the MPEG-2 AAC (ISO/IEC 13818-7) was standardized by the MPEG (Moving Pictures Expert Group). The aim of MPEG-2 AAC was to reach “indistinguishable” audio quality at the data rate of 384 kbps or lower for five full-bandwidth channel audio signals as specified by the ITU-R (International Telecommunication Union, Radio-communication Bureau). Testing results showed that MPEG-2 AAC needs 320 kbps to achieve the ITU-R quality requirements. This result showed that MPEG-2 AAC satisfied the ITU-R specifications. The MPEG-2 AAC provides the transparent audio quality at the cost of discarding MPEG-1 backward-compatibility. The MPEG-2 AAC algorithm combines the coding efficiency of a high-resolution filter bank, prediction techniques, Huffman coding and other tools to achieve the audio quality at low data rates. And like most audio coding schemes, the MPEG-2 AAC algorithm compresses signals by removing the redundancy 3.

(14) between samples and the irrelevant audio signals. We can use time-frequency analysis for removing the redundancy between samples, and use the masking properties of human hearing system to remove irrelevant audio signals. Besides, the MPEG-2 AAC system offers three profiles to fulfill the demand of different tradeoffs between audio quality, memory requirement and system complexity. For this purpose, the three profiles are defined as main profile, low-complexity (LC) profile and scalable sampling rate (SSR) profile. The main profile is intended for use when the processing power, and especially the memory, is not better. The LC profile is intended to use when the computing cycles and memory use are constrained, and the SSR profile is in use when a scalable decoder is required. Next, we will briefly introduce each tool in this section. Fig 2.1 gives an overview of the MPEG-2 AAC encoder block diagram.. 4.

(15) In p u t tim e sig n a l. P e rc e p tu a l M odel. G a in C o n tro l. Legend F ilte r B ank. D a ta C o n tro l. TNS. In te n s ity/ C o u p lin g. Q u a n tiz e d S p e c tru m of P re v io u s F ra m e. P re d ic tio n. B itstre a m M u ltip le x. 1 3 8 1 8 -7 C o d e d A u d io S tre a m. M /S. Ite ra tio n L o o p s S c a le F a c to rs. R a te /D is to rtio n C o n tro l P ro c e ss. Q u a n tiz e r. N o ise le ss C o d in g. Fig 2.1 Block diagram for MPEG-2 AAC encoder [1]. 2.1.1 Psychoacoustic Model The psychoacoustic model is an essential component of the AAC encoder that enables its high performance. The job of the psychoacoustic model is to analyze the input audio signal and determine where the spectrum quantization noise can be allowed and to what extent. Then, the encoder uses this information to decide how to represent the input audio signal in the most way with the given limited number of code bits. In this process, the psychoacoustic model calculates the maximum distortion energy value which can be. 5.

(16) masked by the signal. And this energy is called threshold. The threshold generation process has three inputs. They are: 1. The shift length for the threshold calculation process is called iblen. This iblen must remain constant over any particular application of the threshold calculation process. For long FFT iblen = 1024, for short FFT iblen = 128. 2. For each FFT type, the newest iblen samples of the signal, with the samples delayed (either in the filterbank or psychoacoustic calculation) such that the window of the psychoacoustic calculation is centered in the time-window of the codec time/frequency transform. 3. The sampling rate. There are sets of tables that will be used in the calculation process, and the tables are provided for the standard sampling rates. Sampling rate must necessarily remain constant over one implementation of the threshold calculation process.. The outputs of the psychoacoustic model are: 1. a set of Signal-to-Mask Ratios and thresholds, which are to be used by the encoder. 2. the delayed time domain data (PCM samples), which are to be used by MDCT. 3. the block type for the MDCT. 4. an estimate of the amount of. bits should be used for encoding in addition to the. average available bits.. Fig 2.2 [2] shows the block diagram for the psychoacoustic model in the MPEG-2 AAC encoder. Unlike the psychoacoustic model 1, this model does not make a dichotomous distinction between tonal and non-tonal components. Instead the spectral data is transformed to a “partition” domain and the fractions of the tonal and non-tonal components are estimated in each partition. This fraction ultimately determines the amount of masking. For more detailed procedures for calculation, please see [2].. 6.

(17) Fig 2.2 Block diagram of psychoacoustic model [2]. 2.1.2 Gain Control The gain control tool receives the input time-domain signals, and then ouputs gain_control_data and a gain controlled signal whose length is equal to the length of the MDCT window. The tool consists of a PQF (Polyphase Quadrature Filter), gain detectors and gain modifiers. The PQF divides the input signals into four equal width frequency bands. The gain detectors produce gain control data, which satisfies the MPEG bitstream syntax. They consist of the number of gain changes, the index of gain change positions and the index of gain change level. The gain modifier for each PQF band controls the 7.

(18) gain of each signal band. And the gain control tool can be applied to each of four bands independently. The block diagram for the gain control tool is shown in Fig 2.3.. Fig 2.3 Block diagram of gain control tool [2]. 2.1.3 Filterbank The filterbank maps the signal samples into a spectral representation using a modified discrete cosine transformation (MDCT) with critical subsampling and overlapping subsequent windows. The MDCT employs TDAC (time-domain aliasing cancellation) technique. In the encoder, the filterbank takes in the appropriate block of time samples, modulates them by an appropriate window function, and performs the MDCT. Each block of input samples is overlapped by 50% with the immediately preceding block and the following block in order to reduce the boundary effect. The mathematical expression of the MDCT is. 8.

(19) N −1 ⎡ 2π 1 ⎞⎤ ⎛ X i ,k = 2∑ xi ,n cos ⎢ (n + n0 )⎜ k + ⎟⎥, 2 ⎠⎦ ⎝ n =0 ⎣N. k = 0,1,...,. N −1 2. (2.1). where n. = sample index. N. = transform block length. i. = block index. k. = coefficient index. n0. = (N/2+1)/2. Since the window function has a significant effect on the filterbank frequency response, the filterbank has been designed to allow a change in window length and shape to match to the input signal characteristics. There are two resolutions in AAC, one with 1024 spectral coefficients (one long window) and one with eight sets of 128 coefficients (eight short windows) and the switching between them is supported through the use of transition windows. The encoder also selects the optimal shape for each of these windows between the Kaiser-Bessel-derived window (KBD) with improved far-off rejection and the sine window with a wider main lobe.. Fig 2.4 Window shape adaptation process [2] 9.

(20) Fig 2.5 Block switching during transient signal conditions [2]. 2.1.4 Prediction Prediction is used for an improved redundancy reduction and is very effective in the stationary parts of a signal. The current spectral coefficient is estimated by the predictor based on the corresponding spectral coefficients of the preceding two frames and only the prediction errors need to be transmitted. For each channel prediction is applied to the spectral components resulting from the filterbank. For each spectral component, there is one corresponding predictor resulting in a bank of predictors. Each predictor exploits the auto-correlation between the spectral component values of consecutive frames. The predictor coefficients are calculated from preceding. quantized. spectral. components. in. the. encoder.. A second. order. backward-adaptive lattice structure predictor is working on the spectral component values of the preceding frames. The predictor parameters are adapted to the current signal statistics on a frame-by-frame base, using an LMS-based adaptation algorithm. If the prediction is activated, the quantizer is fed with the prediction error. Fig 2.6 shows the block diagram of prediction unit for one scalefactor band.. 10.

(21) Fig 2.6 Prediction tool for one scalefactor band [2]. 2.1.5 Temporal Noise Shaping (TNS) The Temporal Noise Shaping tool is used to control the temporal shape of the quantization noise within each window of the transform, which is needed for transient and pitched signals. This is done by applying a filtering process to parts of the spectral data of each channel. The tool can provide considerable enhancement to the audio quality for the speech and transient signals.. 2.1.6 Joint Stereo Coding AAC joint stereo coding reduces the needed bitrate for stereo or multichannel signals more efficiently than separate coding of several channels. There are two different joint stereo methods that can be selected for coding of different frequency bands to optimize the resulting bitrate: M/S stereo coding and intensity stereo coding. 1. M/S stereo coding: The decision to code left and right coefficients as either left/right (L/R) or mid/side 11.

(22) (M/S) is made on a noiseless coding band by noiseless coding band basis for all spectral coefficients in the current block. M/S stereo coding is very efficient for near monophonic signals, because it use a sum (M) and a difference (S) channel instead of left and right channels and the difference signals is very small in this case. If the high correlated left and right signals could be summed, the require bits to code this signals will be less. Therefore, when the left and right signals’ correlation is higher than a threshold, the M/S stereo coding tool will operate on transforming the L/R signals to M/S signals. 2. Intensity stereo coding: The intensity stereo coding tool is used to exploit irrelevance between high frequency signals of each pair of channels. It adds high frequency signals from left and right channel and multiplies to a factor to rescale the result. The intensity signals are used to replace the corresponding left channel high frequency signals, and corresponding signals of the right channel are set to zero. In this AAC system, the intensity stereo coding mechanism is implemented in the LC profile.. 2.1.7 Quantization AAC uses the nonuniform power-law quantization, where smaller values are quantized finer, so that quantization noise is stronger at larger values and is easier masked. Scalefactors are used to scale the spectral coefficients before the quantization to be able to control the power of the introduced quantization noise.. ⎡ ⎤ xr (i) ix(i) = sgn(xr (i)) ⋅ NINT ⎢ )0.75 −0.0946⎥ _stepsize 4 quantizer ⎣ 2 ⎦. (2.2). The AAC quantization module consists of three levels. The top level calls a subroutine named “outer iteration loop”, which calls the subroutine “inner iteration loop”. The outer iteration loop (distortion control loop) controls the quantization noise which is produced by the quantization of the frequency domain lines within the inner iteration loop to maintain perceptual performance. The inner interation loop (rate control loop) 12.

(23) calculates the actual quantization of the frequency domain data to maintain bit rate.. 2.1.8 Noiseless Coding In the AAC encoder the input to the noiseless coding module is the set of 1024 quantized spectral coefficients. Since the noiseless coding is done inside the quantizer inner loop, it is part of an iterative process that converges when the total bit count is within some interval surrounding the allocated bit count. The noiseless coding stage in AAC uses sectioning and Huffman coding (entropy coding) and exploits statistical redundancy to efficiently encode the 1024 coefficients without further loss of information. The noiseless coding segments the set of 1024 quantized spectral coefficients, such that a single Huffman codebook is used to code each section. The Huffman coding is used to represent n-tuples of quantized coefficients, with 12 codebooks can be used. The spectral coefficients within n-tuples are ordered and the n-tuple size is two or four coefficients. Each codebook specifies the maximum absolute value that it can represent and the n-tuple size.. 2.2 MPEG-4 AAC Version1 MPEG-4 is formal as its ISO/IEC designation “ISO/IEC 14496”, and it includes the major parts: Systems, Audio, Video and DMIF. Specially, compared to previous MPEG standard, MPEG-4 has the following concepts: universality, scalability, object-based representation, content-based interactivity and natural and synthetic representations. MPEG-4 AAC Version 1 was finalized in October 1998 and became an International Standard in the first months of 1999. It is fully backward compatible with MPEG-2 AAC, and includes some additional tools such as the long term predictor (LTP) tool, perceptual noise substitution (PNS) tool and transform-domain weighted interlaced vector quantization (TwinVQ) tool. The PNS tool and the LTP tool are available to enhance the 13.

(24) coding performance for the noise-like and very tonal signals, respectively. The TwinVQ tool is provided to cover very low bitrates. This new scheme which combined AAC with TwinVQ is officially called "General Audio (GA)." Next, we will briefly introduce these new tools.. Fig 2.7 Block diagram of MPEG-4 GA encoder [2]. 14.

(25) 2.2.1 Long Term Prediction (LTP) The LTP tool is well-known from speech coding and is used to exploit redundancy in the speech signal which is related to the signal periodicity as expressed by the speech pitch. The LTP tool has been integrated into the audio coder where quantization and coding is performed on the input signal. Fig 2.8 shows the combined LTP and coding system.. Fig 2.8 LTP in the MPEG-4 General Audio encoder [2]. The LTP is used to predict the input signal based on the quantized values of the preceding frames which were transformed back to a time domain representation by the inverse filterbank and the associated inverse TNS operation. Comparing this decoded signal to the input signal, the optimum pitch lag and gain factor is determined. Then, the difference between the predicted signal and the original signal is calculated and compared with the original signal. One of them is selected to be coded on a scalefactor band basis depending on which alternative is more favorable. This is achieved by means of the “frequency selective switch” (FSS). The LTP tool provides considerable coding gain for stationary harmonic signals and some gain for non-harmonic tonal signals. Besides, the computational complexity of the 15.

(26) LTP tool is much less than original prediction tool.. 2.2.2 Perceptual Noise Substitution (PNS) The PNS tool allows for a very compact representation of noise-like signal components because only the signaling and the energy information is transmitted once for a scalefactor band instead of the set of quantized and coded spectral coefficients. Therefore, it increases compression efficiency for certain types of input signals. Fig 2.9 shows the PNS concept.. Fig 2.9 Principle of Perceptual Noise Substitution [2]. 2.2.3 TwinVQ The TwinVQ tool is an alternative VQ-based coding kernel. It can provide good coding performance at very low bitrates (at or below 16kbps). When it performs the quantization of the spectral coefficients, the spectral coefficients will first be normalized to a specified target range and then be quantized by using the weighted vector quantization (VQ) process. The Fig 2.10 shows the TwinVQ tool module.. 16.

(27) Fig 2.10 TwinVQ quantization scheme [2]. 2.3 MPEG-4 AAC Version2 MPEG-4 AAC Version 2 was finalized in 1999. Compared to MPEG-4 AAC version 1, it adds some new tools without replacing any existing tools of version 1. So, it is fully backward compatible to version 1. The version 2 provides the following new functionalities: Error Robustness, Low-Delay Audio Coding, Fine grain scalability and so on. Next, we will briefly introduce these new tools in this section.. 2.3.1 Error Rubustness The Error Robustness tools provide improved performance on error-prone transmission channels. The two classes of tools are the Error Resilience (ER) tool and Error Protection (EP) tool. The ER tool reduces the perceived distortion of the decoded audio signal that is 17.

(28) caused by corrupted bits in the bitstream. The following tools are provided to improve the error robustness for several parts of an AAC bitstream frame: Virtual CodeBook (VCB), Reversible Variable Length Coding (RVLC), and Huffman Codeword Reordering (HCR). These tools allow the application of advanced channel coding techniques, which are adapted to the special needs of the different coding tools. The EP tool provides Unequal Error Protection (UEP) for MPEG-4 Audio. UEP is an efficient method to improve the error robustness of source coding schemes. It is used by various speech and audio coding systems operating over error-prone channels such as mobile telephone networks or Digital Audio Broadcasting (DAB). The bits of the coded signal representation are first grouped into different classes according to their error sensitivity. Then error protection is individually applied to the different classes, giving better protection to more sensitive bits.. 2.3.2 Low-Delay Audio Coding The MPEG-4 General Audio Coder provides very efficient coding of general audio signals at low bitrates. However it has an algorithmic delay of up to several 100ms and is thus not well suited for applications requiring low coding delay, such as real-time bi-directional communication. To enable coding of general audio signals with an algorithmic delay not exceeding 20 ms, MPEG-4 Version 2 specifies a Low-Delay Audio Coder which is derived from MPEG-2/4 Advanced Audio Coding (AAC). It operates at up to 48 kHz sampling rate and uses a frame length of 512 or 480 samples, compared to the 1024 or 960 samples used in standard MPEG-2/4 AAC. Also the size of the window used in the analysis and synthesis filterbank is reduced by a factor of 2. No block switching is used to avoid the “look-ahead” delay due to the block switching decision. To reduce pre-echo phenomenon in case of transient signals, window shape switching is provided instead. For non-transient parts of the signal a sine window is used, while a so-called low overlap window is used in case of transient signals. Use of the bit reservoir is minimized in the encoder in order to reach the desired target delay. As one extreme case, no bit reservoir is used at all.. 18.

(29) 2.3.3 Fine Grain Scalability Bitrate scalability, also known as embedded coding, is a very desirable functionality. In order to provide efficient small step scalability for the AAC, the Bit-Sliced Arithmetic Coding (BSAC) tool is available in version 2. This tool is used in combination with the AAC coding tools and replaces the noiseless coding of the quantized spectral data and the scalfactors. BSAC provides scalability in steps of 1kbps per audio channel, which means 2kbps steps for a stereo signal. One base layer bitstream and many small enhancement layer bitstreams are used.. 2.3.4 Parametric Audio Coding The Parametric Audio Coding tools combine very low bitrate coding of general audio signals with the possibility of modifying the playback speed or pitch during decoding without the need for an effects processing unit. In combination with the speech and audio coding tools of version 1, improved overall coding efficiency is expected for applications of object based coding allowing selection and switching between different coding techniques.. 2.3.5 CELP Silence Compression The silence compression tool reduces the average bitrate because of a lower-bitrate compression for silence. In the encoder, a voice activity detector is used to distinguish between regions with normal speech activity and those with silence or background noise. During normal speech activity, the CELP coding as in version 1 is used. Otherwise a Silence Insertion Descriptor (SID) is transmitted at a lower bitrate. This SID enables a Comfort Noise Generator (CNG) in the decoder. The amplitude and spectral shape of this comfort noise is specified by energy and LPC parameters similar as in a normal CELP frame. These parameters are an optional part of the SID and thus can be updated as. 19.

(30) required.. 2.3.6 Extended HVXC The variable bitrate mode of 4.0 kbps maximum is additionally supported in version 2 HVXC. In the version 1 HVXC, variable bitrate mode of 2.0 kbps maximum is supported as well as 2.0 and 4.0 kbps fixed bitrate mode. In version 2, the operation of the variable bitrate mode is extended to work with 4.0 kbps mode. In the variable bit-rate mode, non-speech part is detected from unvoiced signals, and smaller number of bits are used for non-speech part to reduce the average bitrate. When the variable bit-rate mode of 4.0 kbps maximum is used, the average bit rate goes down to approximately 3.0 kbps with typical speech items.. 20.

(31) Chapter 3 DSP Implementation Environment In our project, we choose digital signal processor (DSP) platform to implement MPEG-4 AAC encoder and decoder. The DSP baseboard we use is made by Innovative Integration's (II’s) Quixote, which houses Texas Instruments' TMS320C6416 DSP chip and Xilinx Virtex-II FPGA. In this chapter, we will introduce the DSP baseboard and DSP chip. At the end, the data transmission process from the host PC to the target DSP and vice versa is also introduced.. 3.1 DSP Baseboard The Quixote DSP Baseboard card is shown in Fig. 3.1 and the architecture is shown in Fig. 3.2 [5]. Quixote combines one TMS320C6416 600MHz 32-bit fixed-point DSP with one two- or six-million-gate Virtex-II FPGA on the DSP baseboard, utilizing the signal processing technology to provide extreme processing flexibility and efficiency and deliver high performance. Quixote has 32MB SDRAM for use by DSP and 4 or 8Mbytes zero bus turnaround (ZBT) SBSRAM for use by FPGA. The SDRAM provides a large, fast external memory pool for DSP data and code. The SBSRAM is configured as independent banks for fast data processing storage, directly attached to the FPGA. Developers can build complex signal processing systems by integrating these reusable logic designs with their specific application logic.. 21.

(32) Fig. 3.1 Innovative Integration’s Quixote DSP Baseboard Card [5]. Fig. 3.2 Block Diagram of Quixote [5]. 22.

(33) 3.2 DSP Chip The TMS320C64x fixed-point DSP is using the VelociTI architecture. The VelociTI is a high-performance, advanced VLIW (very long instruction word) architecture, making it an excellent choice for multichannel, multifunctional, and performance-driven applications. VLIW can achieve high performance through increased instruction-level parallelism, performing multiple instructions during a single cycle. Because parallelism takes the DSP well beyond the performance capabilities of traditional superscalar systems, it is the key to high performance. VelociTI is a highly deterministic architecture, having few restrictions on how or when instructions are fetched, executed, or stored. It is this architectural flexibility that is the key to the breakthrough efficiency levels of the TMS320C6000 Optimizing C compiler. VelociTI advanced features include: □ Instruction packing: reduced code size □ All instructions can operate conditionally: flexibility of code □ Variable-width instructions: flexibility of data types □ Fully pipelined branches: zero-overhead branching. Fig 3.3 Block diagram of TMS320C6x DSP [6]. 23.

(34) TMS320C6416 has internal memory includes a two-level cache architecture with 16 KB of L1 data cache, 16 KB of L1 program cache, and 1 MB L2 cache for data/program allocation. Peripherals such as a direct memory access (DMA) controller, power-down logic, and external memory interface (EMIF) usually come with the CPU, while peripherals such as serial ports and host ports are on only certain devices. In the following subsections, we will introduce several important parts of the TMS320C64x DSP Chip.. 3.2.1 Central Processing Unit (CPU) The TMS320C6416 CPU contains of eight independent functional units, sixty-four general purpose registers and control registers. Besides above, it also has the program fetch unit, instruction dispatch unit (attached with advanced instruction packing), instruction decode unit, two data path (A and B, each with four functional units), test unit, emulation unit, interrupt logic and two register files (A and B with respect to the two data paths). The architecture is illustrated in Fig. 3.3 and Fig. 3.4.. Fig. 3.4 The TMS320C64x DSP Chip Architecture and Comparison with Ancient TMS320C62x/C67x Chip.. 24.

(35) The program fetch, instruction dispatch, and instruction decode units can deliver up to eight 32-bit instructions to the functional units during one CPU clock cycle. The processing of instructions occurs in each of the two data paths (A and B), each of which contains four functional units (.L, .S, .M, and .D) and 32 32-bit general-purpose registers. The program pipelining is also the important feature to get parallel instructions working properly. There are three stages of pipelining: program fetch, decode, and execute. In the fetch stage, the program address is generated in the CPU, and then the program address is sent to memory. After a memory read occurs, the fetch packet is received at the CPU. In the decode stage, the instructions in execute packet are assigned to the appropriate functional units. And then, the source registers, destimation registers, and associated paths are decoded for the execution of the instructions in the functional units. The execute stage is composed of five phases, and instructions are executed in the stage. Different types of instructions require different numbers of phases to complete the execution.. 25.

(36) 3.2.2 Data Path. Fig 3.5 TMS320C64x CPU Data Path [6]. There are two general-purpose register files (A and B) in the C6000 data paths as shown in Fig 3.5. The C64x DSP register is double the number of general-purpose registers that are in the C62x/C67x cores, with 32 32-bit registers. The C64x architecture has eight functional units that could be further divided into two data paths A and B. Each path has one unit for multiplication operations (.M), one for logical and arithmetic operations (.L), one for branch, bit manipulation, and arithmetic operations (.S), and one for loading/storing, address calculation and arithmetic operations 26.

(37) (.D). Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. There can be a maximum of two cross-path source reads per cycle. Fig 3.6 and Fig 3.7 show the functional units and its operations.. Fig. 3.6 Functional Units and Operations Performed [7]. 27.

(38) Fig. 3.7 Functional Units and Operations Performed (Cont.) [7]. 3.2.3 Memory 3.2.3.1 Internal Memory The C64x has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When off-chip memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C64x has two 64-bit internal ports to access internal data memory and a single internal port to access internal program memory, with an instruction-fetch width of 256 bits.. 3.2.3.2 External Memory and Peripheral Options The external memory and peripheral options of C6416 contain □ Large on-chip RAM, up to 7M bits □ Program cache 28.

(39) □ 2-level caches □ 32-bit external memory interface supports SDRAM, SBSRAM, SRAM, and other asynchronous memories for a broad range of external memory requirements and maximum system performance. □ DMA Controller transfers data between address ranges in the memory map without intervention by the CPU. The DMA controller has four programmable channels and a fifth auxiliary channel. □ EDMA Controller performs the same functions as the DMA controller. The EDMA has 16 programmable channels, as well as a RAM space to hold multiple configurations for future transfers. □ HPI is a parallel port through which a host processor can directly access the CPU’s memory space. The host device has ease of access because it is the master of the interface. The host and the CPU can exchange information via internal or external memory. □ McBSP (multichannel buffered serial port) is based on the standard serial port interface found on the TMS320C2000 and C5000 platform devices. Besides, the port can buffer serial samples in memory automatically with the aid of the DMA/EDMA controller. It also has multichannel capability compatible with the T1, E1, SCSA, and MVIP networking standards.. 3.3 Data Transmission Mechanism Many applications of the Quixote baseboard involve communication with the host CPU in some manner. All applications at a minimum must be reset and downloaded from the host, even if they are isolated from the host after that. The simplest method supported is a mapping of Standard C++ I/O to the Uniterminal applet that allows console-type I/O on the host. This allows simple data input and control and the sending of text strings to the user. The next level of support is given by the Packetized Message Interface. This allows more complicated medium rate transfer of commands and information between the host and target. It requires more software support on the host than the Standard I/O 29.

(40) does. For full rate data transfers Quixote supports the creation of data streaming to the host, for the maximum ability to move data between the target and host. On Quixote baseboards, a second type of busmaster communication between target and host is available for use, the CPU Busmaster interface.. 3.3.1 DSP Streaming Interface The DSP Streaming interface is bi-directional. Two streams can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “Incoming Stream”. The other stream runs out to the analog peripherals. This is the “Outgoing Stream”. In both cases, the DSP needs to act as a mediator, since there is no direct access to analog peripherals from the host. This arrangement allows the DSP to process the streams as they move from the application to the hardware. DSP Streaming is initiated and started on the Host, using the Caliente component. On the target, the DSP interface uses pair of DSP/BIOS Device Drivers, PciIn (on the Outgoing Stream) and PciOut (on the Incoming Stream), provided in the Pismo peripheral libraries for the DSP. They are capable of copying blocks of data between target SDRAM and host bus-master memory via the PCI interface.. 3.3.2 Burst Block Transmission The interface is based on a streaming model where logically data is an infinite stream between the source and destination. This model is more efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. On the other hand, the streaming model has relatively high latency for a particular piece of data. This is because a data item may remain in internal buffering until subsequence data accumulates to allow for an efficient transfer. The interface uses a different model: it transfers discrete blocks between the source and destination. Each data buffer is transferred completely to the destination in a single operation. The data buffers transferred can be of different sizes. At the destination, the destination buffer is re-sized to allow the incoming data to fit. In this simple blocking interface, there are sending and receiving functions can be 30.

(41) used. The sending function will not return until the transfer has completed and the buffer is ready for reuse. Similarly, the receiving function waits until data has arrived from the data source and transferred into the data buffer before returning.. 3.3.3 Message Exchange Besides the above interfaces, the DSP and host have a lower bandwidth communications link for sending commands or out-of-band information between target and host. Software is provided to build a packet-based message system between the target and the host. These packets can provide a simple yet powerful means of sending commands and information across the link. A set of sixteen mailboxes in each direction to and from host are shared with the DSP to allow for an efficient message mechanism that complements the streaming interface. The maximum data rate is 56 kbps, and the higher data rate requirements should use the streaming interface.. 31.

(42) 32.

(43) Chapter 4 MPEG-4 AAC Encoder Acceleration on DSP In this chapter, we will describe the MPEG-4 AAC code acceleration on DSP. We will first introduce TI’s code development environment, describe how to optimize the C/C++ code for DSP architecture, and then discuss how to optimize the functions for DSP execution.. 4.1 TI’s Code Development Environment In this section, we will briefly introduce the CCS (Code Composer Studio) tool for DSP, and describe how to develop C/C++ code for the given DSP architecture.. 4.1.1 The Code Composer Studio The Code Composer Studio (CCS) is a helpful tool for developing the DSP codes. We briefly describe some of its features related to our implementation below. The details can be found in [6].. □ Compiles code and generates Common Object File Format (COFF) output file. □ Provides debug options such as step over, step in, step out, run free, and so on. □ Watches any memory sections when the DSP halts. □ Probes a PC file stream into or from the target memory location. 33.

(44) □ Counts the instruction cycles between successive profile-points.. We mainly use the CCS tool for debugging, refining, optimizing, and implementing our C codes on DSP. The profile-points help us to evaluate if our changes to the codes are better or not. Besides, we must write the host code and target code with the burst block transmission in order to implement our system on the DSP platform.. 4.1.2 Code Development Flow The DSP code development can be divided into three steps. Step1：Develop the C code like standard ANSI C code without any regard to the particular structure of the C64x. Then, use the debugger to profile the code to identify any inefficient areas in the code. If the performance is not satisfactory, go to step2. Step2：Use DSP intrinsics and optimization techniques for code generation to improve the C codes. Refine the C code procedures such as compiler options, intrinsics, statement, data type modifiers, and code transformations. If the code efficiency is still not sufficient, proceed to step3. Step3：Extract the most time-critical areas and replace the C code with linear assembly, then use assembly optimizer to optimize the code, such as resource allocation. Generally, we do not go to step3 because the linear assembly is too detail. Doing assembly programming is difficult and assembly codes are hard to maintain. The recommended code development flow involves utilizing the C6000 code generation tools to aid optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do the instruction selection, parallelizing, pipelining, and register allocation. Figure 4.1 shows the steps of the software development flow [6].. 34.

(45) Fig. 4.1 Code development flow of C6000. 35.

(46) 4.2 Profile of AAC on DSP We do the essential modifications on the MPEG-4 AAC source C code, and then implement the modified C code on DSP. In order to identify the computational intensive parts of the MPEG-4 AAC encoder, we first use TI CCS profiler to analyze it. Our test sequence is “guitar”, and the data length is about 0.1 second. Table 4.1 shows the profile results at 64k bit rate. We can see clearly that the psycho-acoustic model and the quantization and bit-allocation module are two major computational parts of the AAC encoder. Therefore, we should accelerate these two parts. Function. Execution cycles. Percent (%). Total. 2,126,810,017. 100. Psycho-acoustics. 980,246,737. 46.09. Filterbank. 137,817,289. 6.48. Quantization and Bit-allocation. 1,000,238,751. 47.03. Others. 8,507,240. 0.4. Table 4.1 Profile of AAC encoder on C64x DSP. 4.3 DSP Code Acceleration Methods Improving the execution cycles of the AAC encoder is the main task of our system implementation on DSP. In this section, we will describe several methods that can accelerate our code and reduce the execution time on the C64x DSP. Some of these methods are supported by the features of C64x.. 4.3.1 Setting of Compiler Options The Code Composer Studio (CCS) is a useful GUI tool that helps programmers in 36.

(47) developing DSP codes. Its compiler offers a complicated optimization process that includes several advanced techniques and it takes advantages of the features of the C6000 architecture. Hence, we can configure some setting of the compiler options to optimize our DSP codes efficiently. Table 4.2 shows the compiler options for performance enhancement and Table 4.3 shows those to avoid.. Table 4.2 Compiler Options for Performance Enhancement [6]. Table 4.3 Compiler Options to Avoid on Performance Enhancement [6]. 37.

(48) 4.3.2 Fixed-point Coding The C6000 compiler defines a size for each data type: □ char. 8bits. □ short. 16bits. □ int. 32bits. □ long. 40bits. □ float. 32bits. □ double 64bits. The C64x DSP is a fixed-point processor, so it can only perform fixed-point operations. Although the C64x DSP can simulate floating-point processing, it takes a lot of extra clock cycles to do the same job. The “char”, “short”, “int” and “long” are the fixed-point data types, and the “float” and “double” are the floating-point data types. We test C64x DSP processing time of the assembly instructions “add” and “mul” for different data types. Table 4.4 shows the results. We can clearly see that the floating-point data types need more computation time than the fixed-point data types. Hence, we can accelerate our DSP codes in computation time efficiently by converting the data types from floating-point to fixed-point. Assembly Instruction. Char. short. int. long. float. double. 8-bit. 16-bit. 32-bit. 40-bit. 32-bit. 64-bit. add. 1. 1. 1. 2. 77. 146. mul. 2. 2. 6. 8. 54. 69. Table 4.4 Processing time on the C64x DSP for different data types. 4.3.3 Loop Unrolling Loop unrolling expands the loops so that all iterations of the loop appear in the code.. 38.

(49) It often increases the number of instructions available to execute in parallel. When our codes have conditional instructions, sometimes the compiler may not be sure that the branch will occur or not. It needs more waiting time for the decision of branch operation. If we do loop unrolling, some of the overhead for branching instruction will be reduced. Example 4.1 is the loop unrolling and table 4.5 shows the result.. (a) /*Before unrolling*/. (b) /*After unrolling*/. int i,a=0,b=0; for (i=0;i<10;i++) { a+=i; b+=i; }. int i=0,a=0,b=0; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++; a+=i; b+=i; i++;. Example 4.1 loop unrolling. (a). (b). Execution cycles. 436. 206. Code size. 116. 476. Table 4.5 Comparison between with unrolling and without unrolling We can see clearly that the clock cycle decreases after loop unrolling, but the code size is larger than the original. So generally speaking, if one iteration can execute many instructions, the code size is larger, but it runs faster.. 39.

(50) 4.3.4 Using Intrinsics The TI C6000 compiler provides many special functions that map C codes directly to inlined C64x instructions, to increase C code efficiently. These special functions are called intrinsics. So if the instructions have equivalent intrinsic functions, we can replace them by intrinsic functions directly. The execution time will be decreased because of using intrinsics. Fig 4.2 shows some examples of the intrinsic functions for the C6000 DSP. The entire list of intrinsics for the C6000 DSP can be found in [6].. Fig 4.2 Intrinsic functions of the TI C6000 series DSP (partial list) [6]. 4.3.5 Packet Data Processing We often use a single load or store instruction to access multiple data consecutively located in memory to maximize data throughput. For example, if we can place four 8-bit data or two 16-bit data in a 32-bit space, we can do four or two operations in one clock cycle. This method can improve the code efficiency substantially. In addition, some of the intrinsic functions enhance the efficiency in a similar way.. 40.

(51) 4.3.6 Register and Memory When the accessed data are located in the external memory, we need more clock cycles in transfering data time. So we can use registers to store data in order to reduce transfer time in operation. In DSP code, the pointer, malloc function and so on will locate data in memory. Therefore, sometimes we can adequately modify code to avoid frequently accessing data from/to memory so that the execution time will be decreased.. 4.3.7 Using Macros Because the software-pipelined loop can not contain function calls, it takes more clock cycles to complete the function call. Hence, we can change the functions to the “define” macros under some conditions. In addition, replacing the function with the macro can cut down the code for initial function definition and reduce the number of branches. However, macros are expanded each time they are called if the function has a number of instructions, it is not efficient in memory usage.. 4.3.8 Linear Assembly When we are not satisfied with the efficiency of assembly codes which are generated by the TI CCS compiler, we can convert parts of the C codes into linear assembly and then optimize the assembly directly. But this process generally is too detail and very time consumption in practice. Hence, we will do this process at last if we have strict constrains in processor performance and we have no other algorithms selection.. 41.

(52) 4.4 Psychoacoustic Model From AAC encoding profile in table 4.1, we can see clearly that the psychoacoustic model plays an important role in execution time. To improve the performance, we replace the psychoacoustic model with a new algorithm that was proposed by [9]. Next, we will briefly describe the new algorithm and show the simulation results after improving its performance.. 4.4.1 Optimization of PAM Firstly, we briefly describe the original psychoacoustic model of AAC encoder. Fig 4.3 [9] is this block diagram of PAM (psychoacoustic model). In steps 1-2, the auditory spectrum is computed using the FFT. Then, the real-part spectrums lead to the calculation of partitioned energy, and the imaginary-part spectrums result in the calculation of the unpredictability measure c(w). The unpredictablility measure is first weighted with the energy in each partition, deriving a partitioned unpredictability measure. Then in step 5, both partitioned energy and unpredictability are convolved with the spreading function in order to estimate the effects across the partitioned bands. For each partition, the ratio of the convolved partitioned unpredictability over the convolved partitioned energy spectrum is determined. Then, the tonality index is derived from the logarithm of this ratio in step 6 to indicate if a signal is tonal-like. SNR (Signal-to-Noise Ratio) is computed from the tonality tb(b) in step 7 and then the masking partitioned energy threshold nb(b) is estimated in steps 8-10 and thus the masking curve is estimated. PE (Perceptual Entropy) is calculated for each block type from the ratio e(b)/nb(b) in steps 11-12 to determine the block type. Finally, SMR (Signal-to-Mask Ratio) is computed in step 13 as the output. These SMRs are then sent to the bit allocation routine to determine the number of bits allocated to each subband.. 42.

(53) Fig 4.3 Block diagram of original PAM [9] The step 2 and step 4 have high computational complexity because they include sophisticated mathematical functions. The step 5 includes spreading functions calculations and convolutions, so it also has high computational complexity. Next, the algorithm proposed by [9] that can reduce computational complexity of above mentioned steps will be described. It consists of two points: Reduce calculations of spreading functions as fixed-coefficients The calculations of spreading functions in step 5 are a series of complex functions such as comparisons, square roots, power of tens, squares, and divisions. They are calculated at the square number of partitioned bands and repeatedly estimated every frame. Besides, the spreading functions are only determined by sampling rate and block types. Therefore, we can reduce the calculations by replacing them with fixed-coefficients.. MDCT-based PAM We know that there is one main filterbank MDCT outside the PAM and there is. 43.

(54) another filterbank FFT inside the PAM transforming input samples into spectrums in similar ways. Therefore, we replace FFT by MDCT so that the FFT could be omitted in order to decrease computational complexity. Steps 2-4 are thus calculated on the MDCT, but step 5 requires some modification that only the partitioned energies are convolved with the spreading functions mentioned above because of the lack of phase information. Step 6 is also modified where Spectral Flatness Measure (SFM) [11] is used to generate the tonality index from the MDCT coefficients. The SFM is defined as the ratio of the geometric mean (Gm) of the power spectrum to the arithmetic mean (Am) of the power spectrum. Then, the SFM is converted to decibels. And the tonality index tb can be computed by this formula: tb = min (SFMdB / 60 , 1).. Finally, Fig 4.4 [9] is the block diagram of proposed PAM described in it. The steps 2-6 are using MDCT and SFM, and the spreading functions in step 5 are computed with fixed coefficients.. Fig 4.4 Block diagram of proposed PAM [9] 44.

(55) 4.4.2 Simulation Results on DSP In this section, we have simplified the psychoacoustic model by the algorithm in section 4.4.1 and implemented it on the TI C64x DSP. The fast algorithm result is shown in Table 4.6. Our test sequence is “guitar”, and the length is about 0.1 second. In the original AAC encoder program, the PsyInit function calculates the spreading function. The PsyBufferUpdate function contains the FFT calculation. And the PsyCalculate function does the masking threshold calculation. We can clearly see that the acceleration of the PAM is effective by this new algorithm. Also, we have done the sound quality test. Using the ITU-R BS.1387 PEAQ (perceptual evaluation of audio quality) defined ODG (objective difference grade), we examine some sequences using the fast algorithm. The ODG, which is a measure of quality, is calculated as the difference between the quality rating of the reference and the test signal. The quality ratings are measured with a range of [-4;0], where -4 stands for very annoying difference and 0 stands for imperceptible difference between the reference and the test signal. This parameter represents the audio quality well for good quality codecs. The first test sequence is “guitar” and it has sound variations and is quite complex. The second test sequence is “organ” and it is another instrument music. But its sound is consecutive and delicate. The third test sequence is “eddie_rabbitt” and it is pop music with human voice. The test results are shown in Tables 4.7, 4.8, and 4.9. From these results, the quality test seems acceptable and the acceleration is good. The overall speed-up is around 80 percent, and the ODG drop is less than 0.3 or so. Original. Code size. Execution cycles. PsyInit. 7824. 81,831,864. PsyBufferUpd ate. 2312. 21,822,065. PsyCalculate. 548. 63,774,150. 45. Improvement (%).

(56) Fast algorithm. Code size. Execution cycles. PsyInit. 5336. 63,587,461. 22.3. PsyBufferUpd ate. 432. 24,004. 99.89. PsyCalculate. 408. 10,630,861. 83.33. Table 4.6 The acceleration result of the PAM in the AAC encoder. ODG. 16 kbps. 32 kbps. 64 kbps. 96 kbps. 128 kbps. 196 kbps. 256 kbps. Original. -3.53. -3.37. -0.99. -0.38. -0.26. -0.01. -0.01. Modified. -3.68. -3.62. -1.22. -0.69. -0.57. -0.36. -0.28. Table 4.7 The ODG of test sequence “guitar”. ODG. 16. 32. 64. 96. 128. 196. 256. kbps. kbps. kbps. kbps. kbps. kbps. kbps. Original. -3.89. -3.83. -2.76. -0.41. -0.03. -0.01. -0.01. Modified. -3.76. -3.67. -2.79. -0.29. -0.03. -0.00. -0.01. Table 4.8 The ODG of test sequence “organ”. ODG. 16 kbps. 32 kbps. 64 kbps. 96 kbps. 128 kbps. 196 kbps. 256 kbps. Original. -3.78. -3.40. -0.87. -0.27. -0.11. -0.00. -0.00. Modified. -3.78 -3.79 -1.23 -0.62 -0.46 -0.21 Table 4.9 The ODG of test sequence “eddie_rabbitt”. -0.00. 4.5 Quantization and Bit Allocation The quantization and bit allocation module is essential in AAC. Its operation relies on the information from the psychoacoustic model that provides the best possible listening quality. And from Table 4.1, we can know that they have high computation load. Next,. 46.

(57) we use efficient algorithms [12] [13] to replace the original models.. 4.5.1 A High Quality Requantization Method The requantization is used to calculate quantization error in the outer loop of the bit allocation module. The requantization is described by the following formula:. x _ invquant = Sign ( x _ quant ) ⋅ x _ quant. 4. 3. The calucation of X4/3 have high computation load, so the table lookup method is adopted. We adopt the high quality requantization algorithm in [12] to improve the speed of our system. The algorithm uses linear interpolation for each range of requantization and reduces the approximated error quite efficiently. With this approach, the codec maintains high quality result. The X has a wide range, so using the whole range lookup table is not suitable. To reduce memory usage, this approach uses a 256-entry lookup table instead of the whole range table and its basic operation is shown in the following equation:. X. 4. 3. ⎛X ⎞ = ⎜ × 8⎟ ⎝ 8 ⎠. 4. 3. ⎛X ⎞ =⎜ ⎟ ⎝ 8 ⎠. 4. 3. × 16. This means that a 256-entry lookup table, which stores the values of X4/3 from X=1 to 256 respectively, can be used. And it uses a directly linear interpolation for the other two ranges of requantization. The request quantized values are 8191, and there are three ranges with its dedicated operation for the whole range as shown in Fig 4.5 [12]:. 47.

(58) Fig 4.5 Requantization operation with three ranges [12]. 4.5.2 Simulation Results on DSP Using the above mentioned algorithm, the acceleration result of the requantization on the TI C64x DSP is shown in Table 4.10. Our test sequence is the same as above section. In program, the AACQuantizeInit function consists of the table calculation of X4/3. From this simulation result, we can clearly see that this algorithm is quite efficient because the acceleration rate achieves 40.63 percent. Besides, the data precision is high enough so that it does not affect the accuracy loss. AACQuantizeInit. Code size. Execution cycles. Original. 912. 246,364,788. Fast algorithm. 1572. 146,274,298. Improvement (%). 40.627. Table 4.10 The acceleration result of the Requantization in the AAC encoder. 48.

(59) 4.5.3 Single Loop Distortion Control Algorithm Firstly, we briefly the bit allocation algorithm adopted at AAC specification. Fig 4.6 [13] shows the bit allocation processes in MPEG-4 AAC encoder. The outer iteration loop controls the quantization noise, which comes from the quantization of the spectral signals within the inner iteration loop. The noise spectrum is computed by multiplying the values within the scalefactor bands with the actual scalefactors before quantization. After quantization, the calculation of the quantization noise is processed band-by-band iteratively. If the noise exceeds the specified threshold, the spectral values of the scalefactor bands are amplified by increasing the scalefactor by one. In the inner loop, the encoder iteratively employs the nonuniform quantization, which has three major steps including the quantization of the spectral values, the calculation of actual number of bits using Huffman tables, and the computation of the resulting noise. By applying these three steps iteratively for each frame, the bit allocation algorithm provides the actual number of bits needed to encode the source. The quantizer uses the scalefactor and global gain to fit the requirement for the bit allocation, so we can estimate the quantization noise from the scalefactor and global gain. And it adopts three extra conditions to stop the iterative process as follows: 1. None of the scalefactor bands has more than the allowed distortion. 2. The next iteration would cause the amplification for any of the bands to exceed the maximum allowed distortion value. 3. The next iteration would require all the scale factor bands to be amplified.. 49.

(60) Fig 4.6 Block diagram of bit allocation [13]. Based on the proposed scheme in [13], we replace the iterative process but retain roughly the same listening quality. In the noise shaping scheme, the estimated noise is used to decide the value of scalefactors that can parallelize the quantization noise spectrum with the masking threshold. Hence the algorithm reduces the times of outer loop to one step and thus it provides a significant reduction of execution time. The quantization of the spectral signals xri(j) and its approximation can be derived from 0.75 ⎛ ⎛ xri ( j ) ⋅ scalei ⎞ 0.75 ⎞ ⎛ scalei ⎞ 0.75 ⎜ ⎟ ixi ( j ) = n int ⎜ ⎟ − 0.0946 ⎟ = ( xri ( j )) + ei ( j ) ⋅ ⎜ gain ⎟ gain ⎜⎝ 2 ⎠ ⎝ 2 ⎠ ⎝ ⎠. (. 50. ).