Linear Assembly - DSP Code Acceleration Methods

Chapter 4 DSP Implementation Environment

4.4 DSP Code Acceleration Methods

4.4.7 Linear Assembly

When we are not satisfied with the efficiency of assembly codes which are generated by the TI CCS compiler, we can replace some codes by the assembly codes and then optimize the assembly directly. But this process generally is too detail and very time consumption in practice. Hence, we will do this process at the last step if we have very strict constrains in the processing performance and we have no other algorithms choice.

Chapter 5 Acceleration of MPEG Surround Encoder on DSP Platform

In this chapter, we present several acceleration methods for the MPEG Surround encoder.

We will first introduce the MPEG Surround reference software. By analyzing the complexity of the reference encoder, we determine which parts are required to speed up. Then, we propose several methods to reduce the computing time. After proposing the algorithms, we also implement a MPEG Surround-HEAAC encoder on the DSP platform. Finally, we will show the acceleration results.

5.1 Software Environment

5.1.1 MPEG Surround RM0 Reference Software

The MPEG Surround group of the MPEG standard committee provides a piece of reference software [25]. It is written in the C programming language for the codec specified by ISO/IEC 23003-1. It was originally developed by Agere Systems, Coding Technologies, Fraunhofer IIS, and Philips. We will introduce the MPEG Surround reference encoder in the next section.

5.1.2 Command Line Switches

In the reference software, there are two command line arguments that control the encoder configuration. The first argument is Tree-Config, which defines the tree structure configuration according to Table 5.1. The 5151 and 5152 configurations make use of five

R-OTT modules to produce a mono downmixed output. The encoder with the 525 configuration consists of three R-OTT modules and an R-TTT module, and produces a stereo output.

Table 5.1: Tree structure of the MPEG Surround reference software encoder [10]

5151 Tree Structure

5152 Tree Structure

525 Tree Structure

The reference encoder provides two different time resolutions to extract the spatial parameters. In a spatial frame, there are 32 time slots. The other argument, ParamSlot, defines the time slot to which each parameter set applies. The value of ParamSlot could be set to 16 or 32. In other words, there could be one or two sets of parameter extracted in a spatial frame.

Table 5.2 shows the average spatial parameter bit-rate of different argument settings.

Table 5.2: Bitrates of different argument setting Tree Strcture

5151 5152 525

ParamSlot =16 20.6 Kbps 20.7 Kbps 16.4 Kbps

ParamSlot =32 9.8 Kbps 9.8 Kbps 7.9 Kbps

5.2 MPEG Surround Encoder Complexity Analysis

We profile the MPEG Surround encoder to find which part takes the most computation time. There are two methods in taking the profiles. One is using the C64xx simulator and the other is using the C6416 simulator. The optimization level is set to -o3 (file level). The profiling results using two methods are shown in Figures 5.1 and 5.2, respectively. The test audio sequence is “choir.wav”, which is a 5.1 channel sequence with a sampling rate at 44.1k Hz. The Tree-Config setting is 5151 and the ParamSlot is set to 32.

The C64xx simulator simulates the execution cycles of the C64xx core processor with a flat memory system. That is, it does not simulate the cycles needed to access the peripherals and cache system. In contrast, the C6416 simulator supports L1D, L1P, L2 cache, EDMA, Timer, SDRAM and Generic sync RAM Memory models. Thus, the profiling result of the C6416 simulator is closer to the actual cycles on the DSP hardware.

Table 5.3: Cycles on different simulators

C64xx Simulator C6416 Simulator

Total cycles 6,652,171,801 100% 109,392,624,928 100%

Initialization 546,109,415 8.2% 10,892,715,255 10.0%

QMF Analysis 4,044,561,270 60.8% 59,290,420,712 54.2%

Hybrid Analysis 1,105,261,974 16.6% 20,850,080,358 19.1%

Ottbox 65,800,045 1.0% 516,609,513 0.5%

Hybrid Synthesis 323,692 0.004% 4,749,476 0.004%

QMF Synthesis 871,205,627 13.1% 17,506,903,336 16.0%

others 18,909,778 0.3% 331,146,278 0.3%

MPEG Surround encoder profile using C64xx MPEG Surround encoder profile using C64xx MPEG Surround encoder profile using C64xx MPEG Surround encoder profile using C64xx

Hybrid Synthesis

others Ottbox

QMF Analysis

Initialization

QMF Synthesis Hybrid Analysis

Figure 5.1: Profiling of the MPEG Surround encoder on the C64xx simulator

MPEG Surround encoder profile using C6416 MPEG Surround encoder profile using C6416 MPEG Surround encoder profile using C6416 MPEG Surround encoder profile using C6416

Hybrid Analysis

QMF Synthesis

Initialization QMF Analysis

Ottbox

others

Hybrid Synthesis

Figure 5.2: Profiling of the MPEG Surround encoder on the C6416 simulator

5.3 Memory System

From the profiling result, the total cycles of the C6416 simulator are approximately sixteen times of that of the C64xx simulator, because the C64xx simulator ignores the memory access of the instructions and data. In other words, about 94% of the total cycles are wasted for memory stalls. It means that the system spends a lot time on transferring data. If we can use the memory system more efficiently, the stall cycles can be decreased.

We can collect the cache information as it is generated by the C6416 simulator, and the result is shown on Table 5.4. The core cycles are only 6% of the total cycles, and the data cache hit rate is less than 1%. In the memory hierarchy of our DSP platform, the L1D cache is

too small so that the data cache miss frequently occurs. When data cache miss occurs, it requires additional stall cycles to access data in the external memory. However, as mention in section 4.2.2, the architecture of TI C6000 family provides two-level of cache memory. After enabling the L2 cache, it brings in a great improvement as shown on Table 5.4. The data cache hit rate increases to 99%, so the stall cycles decrease significantly. The percentage of the core cycles also arises to 94%. In this project, the two-level cache configuration gives the most benefit.

Table 5.4: The effect of using L2 cache memory

C6416 Simulator Original L2 Cache

Event Cycles Percentage Cycles Percentage

Total Cycles 109,392,624,928 N/A 7,046,624,226 N/A

Core cycles(excl. stalls) 6,650,964,730 6.06 6,650,933,220 94.1

NOP cycles 1,466,138,665 22.04 1,466,131,719 22.04

Stall Cycles 102,761,643,954 93.94 415,647,210 5.9

Cross Path Stalls 20,708,884 0.02 20,708,819 0.29

L1P Stall Cycles 66,534,794 0.06 107,803,638 1.53

L1D Stall Cycles 102,675,148,392 93.86 286,779,377 4.07

Instruction cache hits 1,784,119,231 98.31 1,774,843,175 97.8

Instruction cache misses 30,710,512 1.69 39,996,118 2.2

Data cache references 1,365,575,405 N/A 1,365,580,304 N/A

Data cache reads 675,741,828 49.48 675,744,284 49.48

Data cache writes 689,833,562 50.52 689,836,072 50.52

Data cache hits 6,448,798 0.47 1,359,374,412 99.55

Data cache read hits 6,238,724 0.92 671,482,721 99.37

Data cache write hits 210,074 0.03 687,891,691 99.72

Data cache misses 1,359,126,607 99.53 6,205,892 0.45

Data cache read misses 669,503,116 99.08 4,261,532 0.63

Data cache write misses 689,623,491 99.97 1,944,360 0.28

5.4 Floating-point to Fixed-point Conversion

The MPEG Surround reference software is written in C code with floating-point datatypes. However, it is inefficient to implement a floating-point program on a fixed-point DSP such as C64x. Hence, in order to reduce the execution clock cycles, it is necessary to convert the floating-point code to the fixed-point codes.

5.4.1 Word-length Determination

During the conversion, we collect the dynamic range information in the execution functions. Based on the collected information, we choose an appropriate word-length to each variable. The word-lengths must be carefully designed to avoid overflow. Figure 5.3 shows the converted fixed-point data word-lengths between the modules in the MPEG Surround encoder.

Figure 5.3: Data word-lengths of the fixed-point encoder

The input and output waveforms are stored in 16-bit PCM files. After passing through the analysis QMF filters, the input data are transformed to the sub-band domain with a 16-bit representation. The low frequency QMF coefficients are fed into a second-stage filterbank.

The output coefficients of the analysis hybrid filterbank are stored in a 32-bit datatype to maintain their accuracy. Then, the R-Ottboxes perform downmix process and the word-length of the downmixed signal (in the sub-band domain) is 32 bits.

5.4.2 Simulation Results on DSP

Table 5.5 shows the acceleration result of the fixed-point codes. The result is based on the C6416 simulator with L2 cache enabled. We notice that the execution cycles are significantly reduced.

Table 5.5: Performance of fixed-point codes

Floating-point (Ori.) Fixed-point Reduction ratio (%)

Total cycles 7,046,624,226 47,155,001 99.33

Initialization 553,634,278 256,399 99.95

QMF Analysis 4,315,174,856 20,966,993 99.51

Hybrid Analysis 1,137,277,617 14,947,777 98.69

Ottbox 66,845,629 1,352,341 97.98

Hybrid Synthesis 525,237 294,197 43.99

QMF Synthesis 947,689,855 6,136,115 99.35

others 25,476,754 3,201,179 87.43

The fixed-point conversion procedure introduces distortion due to quantization error, so we compare the converted codes with the original floating-point ones and measure the data precision loss in signal-to-noise ratio (SNR).

Table 5.6: SNR due to fixed-point conversion SNR

Analysis Filterbank 65.25dB

Synthesis Filterbank 65.04dB

Overall 50.99dB

Now again, we profile the fixed-point version of the MPEG Surround reference encoder.

The result is shown in Figure 5.4. We notice the QMF bank and Hybrid analysis filterbank are the major parts of the MPEG Surround encoder. We propose several complexity reduction techniques in the following sections. Our goal is to reduce the codec complexity while not to degrade its output audio quality.

MPEG Surround encoder profile using C6416 MPEG Surround encoder profile using C6416 MPEG Surround encoder profile using C6416 MPEG Surround encoder profile using C6416

(cache enable) (cache enable) (cache enable) (cache enable)

Hybrid Synthesis

others

Ottbox

QMF Analysis

Initialization

QMF Synthesis Hybrid Analysis

Figure 5.4: Profiling of the MPEG Surround encoder on the C6416 simulator (cache enabled)

5.5 Fast QMF Bank Algorithms

The QMF analysis and synthesis procedures spend the most part of the DSP processing time (about 73%) in the MPEG Surround encoder, so we want to speed up this part to improve total system performance.

5.5.1 Problem Definition

As we mention previously in section 3.2.4, the matrix operation in the analysis QMF is defined as:

On the other hand, the matrix operation in the synthesis QMF is:

128 performing a 64-channel synthesis QMF. The computational complexity is huge, so we need fast algorithms for QMF to reduce the complexity.

5.5.2 Analysis Quadrature Mirror (AQMF) Bank

First, we review the type-IV Discrete Cosine Transform (DCT) and Discrete Sine Transform (DST):

DCT-IV: k N

Second, we separate (5.1) into real part and imagery part, shown as (5.5).

)

Consider the real part only

∑

In (5.8), there exist symmetrical relationships between the cosine terms:

128 ,

By applying (5.9) and (5.10) to (5.8), yield:

( ) ( )

Compare (5.11) with (5.3), by denoting

(5.11) can be modified as:

Thus, the real part of the QMF coefficient can be computed by applying a 64-point DCT-IV to )

(n u_r .

Similarly, by denoting

The imagery part of the QMF coefficient can be computed by applying DST-IV:

64 graphs of the decomposition are shown in Figures 5.5 and 5.6, respectively.

64-point

Figure 5.5: Signal flow graph of fast analysis QMF (real part), where the dotted lines denote subtract operations.

64-point

Figure 5.6: Signal flow graph of fast analysis QMF (imagery part), where the dotted lines denote subtract operations.

5.5.3 N/2-point FFT Algorithm for DCT-IV and DST-IV

There are many algorithms published for calculating the DCT-IV and DST-IV to reduce the total number of multiplications or of all arithmetic operations required [26][27]. However, these proposals may not implement well on a very restricted regular structure such as DSP. In [28], R. Gluth proposes a realization for these transforms based on a standard FFT. Since the optimized FFT programs for TI c6x DSP are available, we adopt the efficient FFT algorithm to improve the performance of DCV-IV and DST-IV.

We will describe the forward operation steps of DCT-IV and DST-IV here, and the derivation of this algorithm can be found in Appendix A. The computational steps for DCT-IV are as follows.

1. Compute u_r =(x₂_r + jx_N₋₁₋₂_r)

2. Multiply the pre-twiddle: ⁸⁾

( 1

4. Multiply the post-twiddle: ⁸⁾

( 1

5. Finally, the coefficients of the DCT-IV output are derived from U_k as:

}

Similarly, the sequence u_r for calculation of the DST-IV is )

We summarize the FFT-based DCT-IV and DST-IV algorithms by the flow diagram shown in Figure 5.7.

) 8 ( +1

− r

N j

8) ( +1

− k

xr u_r

Uk C_k^IV,S_k^IV Figure 5.7: Block diagram of the fast DCT-IV and DST-IV [28]

5.5.4 Using DSP Library

The TI C64x Digital Signal Processor Library (DSPLIB) is an optimized DSP Function Library for C programmers using TMS320C64x devices [30]. It includes many C-callable, assembly-optimized, general-purpose signal-processing routines. By using these routines, the execution speeds are considerably faster than the equivalent codes written in C language.

There are several FFT/IFFT functions available in the DSPLIB. They can be used to replace the C-coded FFT functions in the fast DCT/DST algorithm. The steps of installing the DSPLIB under the Code Composer Studio (CCS) are described as follows.

Link the DSP Library to the application

Ex: \CCStudio_v3.1\c6400\dsplib\lib\dsp64x.lib

Include the appropriate header file Ex: dsp_fft16x16r.h

Follow the DSP Library API to use the kernel

Figure 5.8: TI Complex FFT library API [30]

We compare the 32-point complex FFT in the DSPLIB with the C-complied FFT function. Table 5.7 shows the results. The speed of DSPLIB FFT is about 3.8 times faster than that of the C-complied FFT.

Table 5.7: Comparison of C-complied FFT and DSPLIB FFT Clock Cycle Code Size (bytes)

C-complied FFT 740 1804

DSPLIB FFT 196 856

5.5.5 QMF Bank Acceleration Results

We have modified and accelerated the AQMF bank in the fixed-point version of the MPEG Surround encoder. In addition, the synthesis QMF (SQMF) bank can be accelerated in a similar way. Tables 5.8 and 5.9 show the final results. We notice that the reduction ratios of execution cycles are 78% and 73% in the AQMF bank and the SQMF bank, respectively.

Table 5.8: The acceleration result of AQMF bank

Reduction ratio (%)

Choir 20,966,993 4,343,490 79.28

ts30 20,952,076 4,421,292 78.90

approaching_tunnel 20,972,479 4,442,427 78.82

carneval 21,002,679 4,472,220 78.71

fountain_music 20,989,067 4,438,179 78.85

Table 5.9: The Acceleration result of the SQMF bank

Test Sequence

Original SQMF bank (cycles)

Accelerated SQMF bank (cycles)

Reduction ratio (%)

Choir 6,136,115 1,665,333 72.86

ts30 6,134,446 1,628,339 73.46

approaching_tunnel 6,134,567 1,628,460 73.45

carneval 6,131,875 1,625,768 73.49

fountain_music 6,130,129 1,624,012 73.51

5.6 Fast Hybrid Filterbank Algorithms

The lowest three QMF subbands are fed through a second complex-modulated filterbank to further enhance the low frequency resolution. The sub-filterbank has a filter length of L =12, and the analysis filter for QMF sub-band p is given by:

]

where g^p[m] is the prototype filter, Q^p, the number of frequency bands, and

The sub-filter is time consuming and complex because the sub-filterbank outputs do not down-sampled. However, it can be also implemented by using the DFT algorithms.

5.6.1 Fast Analysis Hybrid Filterbank

The sub-filterbank splits the 0^th QMF band coefficients into 8 sub-subbands and it splits the 1^st and 2^nd QMF band coefficients into 2 sub-subbands, respectively. We derive the fast algorithm for band 0 only because the others are considerably simple. For simplicity, the QMF band index p is omitted, and (5.16) yields:



Change the index expression

With (5.17), this yields

∑

Equation (5.20) can be viewed as a polyphase decomposition of g[n] By expending the exponential term in (5.19):

∑

Compare (5.21) with the 8-point complex FFT, the above equations show that the hybrid analysis filterbank can be calculated by the following steps:

1. Perform the polyphase filtering to compute a_m₂ by equation (5.20).

2. Circularly shift the indices of a_m₂ by 6: am^' ₂ =a₍m₂₊₆_)%₈

3. Multiply the twiddle factor: ^' ⁸

4. Perform the 8-point complex FFT: { }

q FFT b

S =

The flow diagram of the analysis hybrid filterbank is shown in Figure 5.9. Again, the clock cycles of FFT can be further reduced by using the DSPLIB.

)

Figure 5.9: Signal flow graph of the fast analysis hybrid filterbank

5.6.2 Analysis Hybrid Filterbank Implementation Results

Table 5.10 shows the implementation results of the Analysis Hybrid filterbank. By using fast algorithms, the reduction ratio of the execution cycles is about 77% in the analysis hybrid filterbank.

Table 5.10: The acceleration result of analysis hybrid filterbank

Test Sequence

Original Analysis Hybrid FB (cycles)

Accelerated Analysis Hybrid FB (cycles)

Reduction ratio (%)

Choir 14,947,777 3,404,386 77.22

ts30 14,940,437 3,405,809 77.20

approaching_tunnel 14,946,461 3,402,930 77.23

carneval 14,940,967 3,397,464 77.26

fountain_music 14,948,398 3,405,428 77.22

5.7 LFE Channel Acceleration

In a standard 5.1-channel audio playback system, the ‘.1’ channel means a band-limited Low Frequency Effects (LFE) channel. In contrast to the main channels, the LFE channel delivers bass-only information and has no direct effect on the perceived directionality of the reproduced soundtrack. Figures 5.10 and 5.11 show the spectrums of the front-left channel and the LFE channel, respectively. We notice that the signal of the front-left channel almost covers the full bandwidth while that of the LFE channel only has a frequency range below 500 Hz. Hence, in order to accelerate the encoder, we simplify the calculations associated with the LFE channel.

Figure 5.10: The spectrum of the front-left channel signal

Figure 5.11: The spectrum of the LFE channel signal

We analyze the output signal of the analysis QMF bank, and find that the 0^th QMF subband contains most of energy in the LFE channel. So we adopt a simplified QMF bank for the LFE channel, in which only the output signals in the lowest band will be computed. The outputs in the 0^th band are calculated by simply applying a FIR filter with the QMF coefficients. As for the other subbands (1^st to 63^rd bands), the output signals are directly assigned to zero.

In the analysis hybrid filterbank, the frequency resolution of the lowest three QMF bands is increased. For the LFE channel, only the 0^th band contains information, so the sub-filters of

the 1^st and 2^nd bands can be removed. Figure 5.12 shows the structure of the simplified analysis filterbanks for the LFE channel. Compared to the main channels, it only calculates the output coefficients corresponding to the QMF band 0.

)

Figure 5.12: Simplified analysis filterbanks for the LFE channel

Table 5.11 shows the acceleration results for the LFE channel. The execution cycles in the filterbanks of the LFE channel are about 38% reduced.

Table 5.11: Acceleration result of LFE channel

Test Sequence

Original LFE (cycles)

Accelerated LFE (cycles)

Reduction ratio (%)

Choir 1,291,313 804,800 37.68

ts30 1,290,539 817,511 36.65

approaching_tunnel 1,292,189 799,473 38.13

carneval 1,296,264 799,474 38.32

fountain_music 1,295,359 799,475 38.28

5.8 Implementation of MPEG Surround-HEAAC Encoder

In the previous work, Huang has implemented the HEAAC encoder on the C64x DSP platform. The source code of the HEAAC is originally provided by 3GPP [31]. He adopted several acceleration methods to speed up the calculations on the DSP [32]. We like to extend Huang’s work by including the MPEG Surround encoder together with the HEAAC encoder.

The combined MPEG Surround-HEAAC encoder is able to compress a 5.1-channel audio at low bitrates.

Figure 5.13 shows the structure of the interconnection of the MPEG Surround encoder and the HEAAC encoder. The AQMF bank, downsampling filter, AAC, and SBR tools are the modules related to HEAAC encoder (in the dotted box), and the others are related to MPEG Surround encoder. In the MPEG Surround encoder, the downmixed signal is transformed back to the time domain and fed in the HEAAC encoder. We will give a brief introduction for the HEAAC encoder.

Figure 5.13: Interconnection of the MPEG Surround-HEAAC encoder

5.8.1 HEAAC Encoder

MPEG-4 HEAAC is a combination of Spectral Band Replication (SBR) tool and MPEG-2 AAC LC profile. The SBR system is used as a duel-rate system. The AAC encoder processes only the low frequency part of audio signals. It uses a downsampling filter to obtain

the low frequency part of audio signals. The SBR encoder operates at the original sampling rate. It is devised based on the fact that there are usually high correlations between the lower and higher frequency parts of audio signals, so those correlations are coded as side- information and carried in the bitstream. The extraction of SBR side-information is processed in the 64-channel QMF domain, which is same as for the MPEG Surround. The details of MPEG-2 AAC and SBR can be found in [16] and [17], respectively.

5.8.2 Complexity Analysis

First, we connect the accelerated MPEG Surround encoder with the HEAAC encoder and profile the main modules for the MPEG Surround-HEAAC encoder on the C6416 simulator with L2 cache enabled. The result is shown on Table 5.12. The execution cycles of the MPEG Surround have about 12% of the overall system.

Table 5.12: Profiling of the MPEG Surround-HEAAC encoder

Cycles Percentage

Total 91,943,427 100%

QMF Analysis 4,012,967 4.36%

Hybrid Analysis 3,173,968 3.45%

Ottbox 1,357,891 1.48%

Hybrid Synthesis 302,233 0.33%

MPEG Surround

Encoder (Accelerated)

QMF Synthesis 1,604,333 1.74%

Down-sampling Filter 20,513,525 22.31%

QMF Analysis 19,915,569 21.66%

SBR 19,741,829 21.47%

HEAAC Encoder

AAC 17,745,833 19.30%

others 3,575,279 3.89%

5.8.3 Simplified MPEG Surround-HEAAC Structure

We simplify the structure of the original MPEG Surround-HEAAC encoder. The structure is shown in Figure 5.14. As we know that the SBR is applied in the QMF domain, so it is unnecessary to transform the downmixed signal back to the time domain. Instead, we remove the SQMF and AQMF procedures before the SBR module. Furthermore, we combine the SQMF bank and downsampling filter to a downsampled SQMF bank, so the computational complexity can be further reduced. We will describe the downsampled SQMF bank below.

Figure 5.14: The structure of the simplified MPEG Surround-HEAAC encoder

The downsampled synthesis filtering is achieved using a 32-channel QMF bank. The output from the filterbank is real-valued time domain samples. The process comprises the following steps, where an array v consisting of 640 samples is assumed.

1. Shift the samples in the array v by 64 positions. The oldest 64 samples are discarded.

2. The 32 new complex-valued subband samples are multiplied by the matrix N





≤

 ≤



 



 + −

= 0 64.

, 32 , 0

) 127 2

)(

5 . 0 exp (

64 ) 1 ,

( n

n k k

n i k

N π

In the equation, exp() denotes the complex exponential function and i is the imaginary unit. The real part of the output from this operation is stored in the positions 0 to 63 of array v.

3. Extract samples from v according to

在文檔中環繞MPEG編解碼器之增速及其在TI DSP平台上的實現 (頁 50-0)