Implementation on DSP - MPEG-4 AAC Decoder Implementation and Optimization on DSP

Chapter 4 MPEG-4 AAC Decoder Implementation and Optimization on DSP

4.5 Implementation on DSP

+2 " 3 # % &

Table 4.10 DSP implementation results of IMDCT

Fig. 4.11 TI IFFT library [7]

Then we compare the modification IMDCT to the IMDCT with TI IFFT library as shown in Fig. 4.11. Table 4.11 shows the comparison of the modification IMDCT and the IMDCT using TI IFFT library. The performance has reached about 81% of the IMDCT with TI IFFT library.

# ( 3

6 , %

+2 " 3 # % &

Table 4.11 Comparison of modification IMDCT and IMDCT with TI IFFT library

4.5 Implementation on DSP

We has implemented and optimized MPEG-4 AAC on TI C64x DSP. The optimized result has been shown in Table 4.12. Using the ITU-R BS.1387 PEAQ (perceptual evaluation of audio quality) defined ODG (objective difference grade), we test some sequences on the modified MPEG-4 AAC decoder. The first test sequence is “guitar”; it has sounds variations and is more complex. The second test sequence is “eddie_rabbitt”; it is a pop music with human voice. The test result is shown in Table 4.13 and 4.14. The notation (a) is the original

floating point version, and (b) is the modified integer version. It seems acceptable in the data rate from 32 kbps to 96 kbps. Finally, the overall speed is 2.73 times faster than the original architecture. Note that the IMDCT part is 1/14 of the original in computation, and the result in shown in table 4.14.

! "

0 + $

+2 " 3 # & %& & & & %4&

Table 4.12 Comparison of original and the optimized performance

+ 7

Table 4.13 The ODG of test sequence “guitar”

+ 7

Table 4.14 The ODG of test sequence “eddie_rabbitt”

Chapter 5 MPEG-4 AAC Decoder Implementation and

Optimization on DSP/FPGA

In the last chapter, we describe the implementation and optimization of the MPEG-AAC decoder on DSP. Also, in this chapter, we will move some of MPEG-4 AAC tools to FPGA to enhance the performance. From the statistic profile, the Huffman decoding and the IMDCT are the heaviest work tools for DSP processing, so we try to implementation these tools on the FPGA.

5.1 Huffman Decoding

In this section, we describe the implementation and optimization of the Huffman decoding on FPGA. We will implement two different architectures of Huffman decoder and compare the results.

5.1.1 Integration Consideration

In the MPEG-4 AAC decoder, the Huffman decoder receives a series of bits ranging from 1 bit to 19 bits from the input bitstream. It uses these bits to search for the matched pattern in the code index table. Then it returns a code index and length. The code index is ranging from

0 to 120, and we will take this value to find the codeword from the codeword table. Fig. 5.1 shows the flow diagram of the MPEG-4 AAC Huffman decoding process.

Fig. 5.1 Flow diagram of MPEG-4 AAC Huffman decoding

As we can see, the length of a symbol in the bitstream varies from 1 bit to 19 bits. The range of the code index in the table is 0 to 120, and its length is fixed to 7 bits. DSP is not suitable to do the variable length data processing, because it needs many extra clock cycles to find the correct length. Hence, we map out the MPEG-4 AAC Huffman decoder on DSP/FPGA. The patterns in the code index table are variable length, so we put it on FPGA;

and the patterns in the codeword table are fixed length, so we put it on DSP. Fig. 5.2 shows the scheme of the DSP/FPGA integrated Huffman decoding.

Fig. 5.2 Block diagram of DSP/FPGA integrated Huffman decoding

5.1.2 Fixed-output-rate Architecture

We put the code index table on FPGA. Also we want to implement the fixed-output-rate Huffman decoder architecture on FPGA. If we want to enhance the Huffman decoding performance substantially, we have to implement the parallel model on FPGA. This architecture outputs one code index in one clock cycle continuously.

We designed the code index table with the necessary control signals, Fig. 5.3 shows the block diagram. Because the code index range is from 0 to 120, we use 7-bit to represent the data. Allowing DSP fetch the code index easily, we put one bit “0” between two adjacent code indices in the output buffer. Fig 5.4 shows the output buffer diagram. In this way, the DSP can fetch the code index in “char” datatype easily.

Fig. 5.3 Block diagram of fixed-output-rate architecture

Fig. 5.4 Output Buffer of code index table

The architecture needs some control signals between DSP and FPGA. When the DSP sends the “input_valid” signal to FPGA, it means the “input_data” is valid now. When the FPGA receives the “input_valid” signal and the FPGA is not busy, it would send a response of

“input_res” signal to DSP, means the FPGA has received the input data successfully. But when the FPGA is busy, it would not send the “input_res” signal, meaning the FPGA has not

Fig 5.5 Waveform of the fixed-output-rate architecture

5.1.3 Fixed-output-rate Architecture

Implementation Result

Fig. 5.6 and Fig. 5.7 show the Xilinx ISE 6.1 synthesis and the P&R (place & route) reports. The P&R report shows that the clock cycle can reach 5.800 ns (172.4 MHz). It needs one clock cycle latency for the input register, meaning that we can retrieve about 156.7 M code indeces in one second. We use a test sequence of 38 frames and it contains 13188 code indeces. The comparison of DSP implementation and the FPGA implementation is shown in the Table 5.1.

Fig 5.6 Synthesis report of the fixed-output-rate architecture Timing Summary:

Speed Grade: -6

Minimum period: 9.181ns (Maximum Frequency: 108.918MHz) Minimum input arrival time before clock: 4.812ns

Maximum output required time after clock: 4.795ns Maximum combinational path delay: No path found

Device utilization summary:

Selected Device : 2v2000ff896-6

Number of Slices: 820 out of 10752 7%

Number of Slice Flip Flops: 379 out of 21504 1%

Number of 4 input LUTs: 1558 out of 21504 7%

Number of bonded IOBs: 284 out of 624 45%

Number of GCLKs: 1 out of 16 6%

Fig 5.7 P&R report of the fixed-output-rate architecture

" ! " 0

( "2 " 4 1 ^/&

7- "2 " 4 1 ^/ 4&&

Table 5.1 The performance Comparison of DSP and FPGA implementation

5.1.4 Variable-output-rate Architecture

The fixed output rate Huffman decoder is limited by the speed of searching for the matched pattern [12]. We can further split the code index table into several small tables to reduce the comparison operations in one clock cycle. In this way, we can use shorten the time of processing short symbol, and it needs more than one clock cycle time to process the long

Timing Summary:

Speed Grade: -6

Device utilization summary:

Number of External IOBs 285 out of 624 45%

Number of LOCed External IOBs 0 out of 285 0%

Number of SLICEs 830 out of 10752 7%

Number of BUFGMUXs 1 out of 16 6%

Fig. 5.8 Block diagram of the variable-output-rate architecture

Fig. 5.9 shows that the waveform and the external control signals between DSP/FPGA are the same for the fixed output rate architecture. The difference between the fixed-output-rate and the variable-output-rate architectures is the internal control signal of the variable-output-rate architecture is more complex, and the variable output rate architecture may need more clock cycle to produce the result.

Fig 5.9 Comparison of the waveform of the two architectures

5.1.5 Variable-output-rate Architecture

Implementation Result

Fig. 5.10 and Fig. 5.11 show the synthesis report and the P&R report. Its clock rate is slower than that of the fixed-output-rate architecture. The implementation of the control signals may be constrained by the FPGA cell. When the control signals of FPGA design are too complex, the controller may become the FPGA system operating bottleneck.

Fig 5.10 Synthesis report for the variable-output-rate architecture Timing Summary:

Speed Grade: -6

Minimum period: 10.132ns (Maximum Frequency: 98.700MHz) Minimum input arrival time before clock: 4.829ns

Maximum output required time after clock: 4.575ns Maximum combinational path delay: No path found

Selected Device : 2v2000ff896-6

Number of Slices: 945 out of 10752 8%

Number of Slice Flip Flops: 402 out of 21504 1%

Number of 4 input LUTs: 1785 out of 21504 8%

Number of bonded IOBs: 284 out of 624 45%

Number of GCLKs: 1 out of 16 6%

Fig 5.11 P&R report for the variable-output-rate architecture

5.2 IFFT

Continuing the discussion in chapter 4, we implement IFFT on FPGA to enhance performance of the IMDCT.

5.2.1 IFFT Architecture

We can compare several FFT hardware architectures [18], shown in Table 5.2. The SDF (single-path delay feedback) means to input one complex data in one clock cycle, then put the input data into a series of DFF (delay flip/flop) to wait for the appropriate time. Then we process the input data which are the source data from the same butterfly in the data flow diagram. The MDC (multi-path delay commutator) means to input two complex data which is the source of the same butterfly in the data flow diagram in one clock cycle. These two data can be processed in one cycle, but it needs more hardware resources. To summary, the SDF

Timing Summary:

Speed Grade: -6

Device utilization summary:

Number of External IOBs 285 out of 624 45%

Number of LOCed External IOBs 0 out of 285 0%

Number of SLICEs 989 out of 10752 9%

Number of BUFGMUXs 1 out of 16 6%

architecture demands fewer registers and arithmetic function units, but the MDC architecture has less latency. We will use the radix-2³ SDF architecture of our IFFT.

* ". !

Table 5.2 Comparison of hardware requirements [18]

Because the data in PE is always multiplier a factor of 2 2, so we can use several shifters and adders to replace the multiplier. At first, we can see the binary representation of the

2 =0.7071=0.10110101,

If we set the “twiddle multiply factor” be 256, then the binary representation can be represented in fixed-point datatype by “10110101.” Then we can use five shifters and five adders to replace one multiplier as Fig 5.12 shows the block diagram. In the Table 5.2, the “t”

represent that the “1” of the simplified multiplier used.

Fig. 5.12 Block diagram of shifter-adder multiplier

5.2.2 Quantization Noise Analysis

First, we want to analyze the quantization noise due to transforming the datatype from floating-point to fixed-point. The original range of the twiddle factor is from –1 to 1, so we need to scalar up the “twiddle multiplier” for integer representation. Also, we need to scalar up the input data to the “scaling multiplier.” At the end, we generate 1000 sets of random input data in the range from –5000 to 5000, and compute the output SNR for the IFFT. If an overflow occurs, the SNR would drop down drastically. Therefore, we do not label the SNR for the overflow codes.

There are two main differences between the FFT and the IFFT. The first one is the twiddle factor is conjugate, and the second is the IFFT has to multiply a 1/N factor but the FFT does not. If we multiply the 1/N factor at the last stage, the SNR would be better, but the effective bit in the output data would be less. So we split the 1/N factor into multiple stage, and each stage is only a multiplication of a factor of 1/2. Fig. 5.13 and 5.14 show the comparison of the noise analysis. As the result, we choose “twiddle multiplier” to be 256, and the “scaling multiplier” to be 1.

Fig. 5.13 Quantization noise analysis for twiddle multiplier with scaling of 256

5.2.3 Radix-2

SDF IFFT Architecture

We use the radix-2³ SDF 512-point IFFT pipelined architecture as Fig. 5.15 shows. The input data from the first one to the last one are put into the IFFT sequentially. Fig. 5.16 shows the computational work for each PE.

Fig. 5.15 Block diagram of radix2³ SDF 512-point IFFT pipelined architecture

Fig. 5.16 Simplified data flow graph for each PE

PE1 has the architecture as Fig 5.17 shows. At the fist N/4 clock cycle, PE1 puts the DFF output data to the PE1 output and put the input data to the DFF input. The next N/4 clock cycle, PE1 multiply the DFF output data by j then put to the PE1 output and put the input data to the DFF input. We can replace the multiplication by exchange the real part and the imaginary part of data. At the last N/2 clock cycle, PE1 add the DFF output data to the input data to the PE1 output, and subtract the DFF data from the input data to the DFF input.

Fig 5.17 Block diagram of the PE1

PE2 has the architecture as Fig 5.18 shows. At the fist N/8 clock cycle, the PE2 put the DFF output data to the PE2 output and put the input data to the DFF input. The next N/8 clock cycle, PE2 multiply DFF output data by j then put to the PE2 output and put the input data to the DFF input. We can replace the multiplication by exchange the real part and the imaginary part of data. At the third N/8 clock cycle, PE2 add the DFF output data to the input data to the PE2 output, and subtract the DFF output data from the input data to the DFF input. At the forth N/8 clock cycle, PE2 add the DFF output data and the input data, then multiply

2 (1+j) to the PE2 output, and subtract the DFF output data from the input data to the DFF input. At the fifth N/8 clock cycle, PE2 put the DFF output data to the output and put the

Fig 5.18 Block diagram of the PE2

PE3 has the architecture as Fig 5.19 shows. At the fist N/8 clock cycle, the PE3 put the DFF output data to the PE3 output and put the input data to the DFF input. At the next N/8 clock cycle, add the DFF output data to the input data to the PE3 output, and subtract the DFF output data from the input data to the DFF input.

Fig 5.19 Block diagram of the PE3

In the beginning, we use a big MUX and control signals to select the twiddle factor. In the Huffman decoding section in this chapter, we found that the complex control signal would slow down the clock. In the IFFT, the complex control signals might not be synthesized in the FPGA. So we try a simple way to implement the twiddle multiplier which does not to use the complex control signals. We put the twiddle factor in a circular shift register in the order and then access the first one at each clock cycle. Fig. 5.20 shows the circular shift register of twiddle factor multiplier. In this way, we can avoid to use complex control signals.

Fig 5.20 Block diagram of the twiddle factor multiplier

The Fig. 5.21 shows the signal waveform of the IFFT. When the DSP sends a

“input_valid” signal to FPGA, it means the input data will start to transfer sequentially. The FPGA sends the “output_valid” signal to DSP meaning the output data will start to transfer in sequentially.

Fig 5.21 Waveform of the radix-2³ 512-point IFFT

5.2.4 IFFT Implementation Result

The Fig. 5.22 and 5.23 show the synthesis report and the P&R repot of the IFFT. The clock frequency on P&R can reach 93.14 MHz. It means it can process 95.9k long window data in one second. We use a test sequence with 12 long window data. The comparison of DSP implementation and FPGA implementation is shown in Table 5.3.

Fig 5.22 Synthesis report of radix-2³512-point IFFT Timing Summary:

Speed Grade: -6

Minimum period: 11.941ns (Maximum Frequency: 83.745MHz) Minimum input arrival time before clock: 2.099ns

Maximum output required time after clock: 4.994ns Maximum combinational path delay: No path found

Selected Device : 2v6000ff1152-6

Number of Slices: 17045 out of 33792 50%

Number of Slice Flip Flops: 28295 out of 67584 41%

Number of 4 input LUTs: 2503 out of 67584 3%

Number of bonded IOBs: 67 out of 824 8%

Number of MULT18X18s: 54 out of 144 37%

Number of GCLKs: 1 out of 16 6%

Fig 5.23 P&R report of radix-2³512-point IFFT

" ! " 0

( "2 " 4 1 ^/

Timing Summary:

Speed Grade: -6

Design Summary Logic Utilization:

Number of Slice Flip Flops: 28,267 out of 67,584 41%

Number of 4 input LUTs: 2,420 out of 67,584 3%

Logic Distribution:

Number of occupied Slices: 15,231 out of 33,792 45%

Number of Slices containing only related logic: 15,231 out of 15,231 100%

Number of Slices containing unrelated logic: 0 out of 15,231 0%

Total Number 4 input LUTs: 2,568 out of 67,584 3%

Number used as logic: 2,420 Number used as a route-thru: 148

Number of bonded IOBs: 68 out of 824 8%

IOB Flip Flops: 28

Number of MULT18X18s: 54 out of 144 37%

Number of GCLKs: 1 out of 16 6%

Total equivalent gate count for design: 464,785

5.3 Implementation on DSP/FPGA

We have implemented and optimized MPEG-4 AAC on TI C64x DSP and Xilinx Virtex-II FPGA. The optimized result has been shown in Table 5.4. We use a 0.95 second test sequence to compare the performance of the DSP implementation and the DSP/FPGA implementation. The overall speed is 8.17 times faster than the original version, and the DSP/FPGA version can process 48-second audio data of in 1 second.

!!"

# $

! "

+ $ 4& & %% 4 4

( #! # 4& & %% 4 & & 4 %4&

( ' 7- 4 & 4 % & 4% % 4

Table 5.4 Comparison of DSP and DSP/FPGA implementation

Chapter 6 Conclusion and Future Work

We have implemented the MPEG-4 AAC decoder on DSP and FPGA together. In this project, we speed up the IMDCT implementation on DSP implementation, and the modified version is 503 times faster than the original version. And then we implement the Huffman decoding and IFFT on FPGA. The implementation and optimized results are faster than the DSP version as expected.

For the IMDCT calculation, we use radix-2³ FFT algorithm in DSP implementation. Then, we use fixed-point data type to present the input data. In addition, we rearrange the data calculation order in IFFT. Furthermore, we use intrinsic functions to speed up the IFFT. The test result is 503 times faster than the original version. The details of our design and results can be found in chapter 4.

We use FPGA to implement the fixed-output-rate Huffman decoder. Also, we modify this architecture to a more efficient variable-output-rate architecture. But the latter is in fact slower than the former due to the complexity of the control signals, which create slow paths on FPGA. The FPGA implementation is about 56 times faster than the DSP implementation.

We also use FPGA to implement IFFT. Similar to the DSP implementation, we use radix-2³ FFT algorithm for IFFT. The 512-point IFFT has a heavy computational load.

Therefore, we use three types of PE to perform these computations in order to reduce the chip area. The FPGA implementation of IFFT is about 4 times faster than the fastest DSP version.

The details of our design and results can be found in chapter 5.

Due to the board hardware defect and/or system software bug, we are unable to run and test our implementations on the DSP/FPGA baseboard yet. Thus, there are two important targets in the future. First, the DSP implementation should be executed on the DSP baseboard, and the streaming interface is needed to connect to the Host PC in real time execution. The Host PC reads in the source data from the file in the memory, and then it transfers the data to

DSP through the streaming interface. After DSP has processed data, it transfers data back to the Host PC. The second target is to integrate the FPGA implementation together with DSP to demonstrate the overall system. DSP does the pre-processing and then it transfers the data to FPGA through the streaming interface. After FPGA has processed the data, it transfers data back to DSP.

Bibliography

[1] ISO/IEC JTC/SC29/WG11 MPEG, International Standard ISO/IEC 13818-7 “Advanced Audio Coding”, 1997

[2] ISO/IEC JTC/SC29/WG11 MPEG, International Standard ISO/IEC 14496-3 “Advanced Audio Coding”, 1999

[3] M. Bosi and et al., “ISO/IEC MPEG-2 Advanced Audio Coding”, JAES, Vol.45, No.10

在文檔中 MPEG-4先進音訊編碼在DSP/FPGA平台上的實現與最佳化 (頁 55-0)