Chapter 5 MPEG-4 AAC Implementation and Optimization on DSP/FPGA
5.3 Implementation on DSP/FPGA
We have implemented and optimized MPEG-4 AAC on TI C64x DSP and Xilinx Virtex-II FPGA. The optimized result has been shown in Table 5.4. We use a 0.95 second test sequence to compare the performance of the DSP implementation and the DSP/FPGA implementation. The overall speed is 8.17 times faster than the original version, and the DSP/FPGA version can process 48-second audio data of in 1 second.
!!"
# $
! "
0
+ $ 4& & %% 4 4
( #! # 4& & %% 4 & & 4 %4&
( ' 7- 4 & 4 % & 4% % 4
Table 5.4 Comparison of DSP and DSP/FPGA implementation
Chapter 6
Conclusion and Future Work
We have implemented the MPEG-4 AAC decoder on DSP and FPGA together. In this project, we speed up the IMDCT implementation on DSP implementation, and the modified version is 503 times faster than the original version. And then we implement the Huffman decoding and IFFT on FPGA. The implementation and optimized results are faster than the DSP version as expected.
For the IMDCT calculation, we use radix-23 FFT algorithm in DSP implementation. Then, we use fixed-point data type to present the input data. In addition, we rearrange the data calculation order in IFFT. Furthermore, we use intrinsic functions to speed up the IFFT. The test result is 503 times faster than the original version. The details of our design and results can be found in chapter 4.
We use FPGA to implement the fixed-output-rate Huffman decoder. Also, we modify this architecture to a more efficient variable-output-rate architecture. But the latter is in fact slower than the former due to the complexity of the control signals, which create slow paths on FPGA. The FPGA implementation is about 56 times faster than the DSP implementation.
We also use FPGA to implement IFFT. Similar to the DSP implementation, we use radix-23 FFT algorithm for IFFT. The 512-point IFFT has a heavy computational load.
Therefore, we use three types of PE to perform these computations in order to reduce the chip area. The FPGA implementation of IFFT is about 4 times faster than the fastest DSP version.
The details of our design and results can be found in chapter 5.
Due to the board hardware defect and/or system software bug, we are unable to run and test our implementations on the DSP/FPGA baseboard yet. Thus, there are two important targets in the future. First, the DSP implementation should be executed on the DSP baseboard, and the streaming interface is needed to connect to the Host PC in real time execution. The Host PC reads in the source data from the file in the memory, and then it transfers the data to
DSP through the streaming interface. After DSP has processed data, it transfers data back to the Host PC. The second target is to integrate the FPGA implementation together with DSP to demonstrate the overall system. DSP does the pre-processing and then it transfers the data to FPGA through the streaming interface. After FPGA has processed the data, it transfers data back to DSP.
Bibliography
[1] ISO/IEC JTC/SC29/WG11 MPEG, International Standard ISO/IEC 13818-7 “Advanced Audio Coding”, 1997
[2] ISO/IEC JTC/SC29/WG11 MPEG, International Standard ISO/IEC 14496-3 “Advanced Audio Coding”, 1999
[3] M. Bosi and et al., “ISO/IEC MPEG-2 Advanced Audio Coding”, JAES, Vol.45, No.10 Oct. 1997
[4] M. Wolters and et al., “A closer look into MPEG-4 High Efficiency AAC”, AES 115th Convention Paper, 2003
[5] Innovative Integration, “Quixote User’s Manual”, Dec. 2003
[6] Texas Instruments, “TMS320C6000 Programmer’s Guide”, SPRU198F, Feb. 2001 [7] Texas Instruments, “TMS320C6000 CPU and Instruction Set Reference Guide”,
SPRU189F, Jan. 2000
[8] Texas Instruments, “TMS320C6000 Peripherals Reference Guide”, SPRU190D, Mar.
2001
[9] Texas Instruments, “TMS320C64x Technical Overview”, SPRU395B, Jan. 2001 [10] Xilinx, “Virtex-II Platform FPGA User Guide”, UG002(v1.7) Feb. 2004
[11] K. S. Lee and et al., “A VLSI implementation of MPEG-2 AAC decoder system,” ASICs, 1999 AP-ASIC '99. The First IEEE Asia Pacific Conf., pp. 139-142, 23-25 Aug. 1999 [12] M. K. Rudberg and L. Wanhammer, “New approaches to high speed Huffman decoding”,
IEEE Int. Symp., Vol. 2, pp. 149-152, 12-15 May 1996
[13] M. K. Rudberg and L. Wanhammar, “High speed pipelined parallel Huffman decoding,”
IEEE Proc. Int. Symp., Vol. 3, pp.2080-2083, 9-12 Jun. 1997
[14] P. Duhamel and et al., “A fast algorithm for the implementation of filter banks based on
‘time domain aliasing cancellation’”, IEEE Trans. Acous., Speech, Signal Processing, ICASSP, Vol. 3, pp. 2209-2212, Apr. 1991
[15] P. Duhamel and H. Hollmann, “Split-radix FFT algorithm for complex, real, and real symmetric data,” IEEE Trans. Acous., Speech, Signal Processing, ICASSP, Vol. 10, pp.
784-787, Apr. 1985
[16] S. He and M. Torkelson, “A new approach to pipeline FFT processor”, IEEE Proc. 10th Int. Parallel Processing Symp., IPPS, Apr. 1996
[17] S. He and M. Torkelson, “Designing pipeline FFT processor for OFDM (de)modulation”, IEEE Proc. URSI Int. Symp. Signals, Syst., Electron., pp. 257-262, Oct. 1998
[18] W. C. Yeh and C. W. Jen, “High speed and low power split-radix FFT,” IEEE Trans.
Signal Processing, Vol. 51, No. 3, Mar. 2003
Appendix A
N/4-point FFT Algorithm for MDCT
We will describe the N/4-point complex FFT in detail in this appendix. We will show the mathematical derivation to the algorithm. The details can be found in [14].
A.1 MDCT
The MDCT can be seen as a block of signals xm(n) project on a set of cosine functions as follow that this transform is not invertible, since
),
only N/2 output points are linearly independent.
However, if two adjacent block xm(n) and xm+1(n) overlap by N/2, the set of values xm(n) can be removed from two successive output sets Ym-1(n) and Ym(n). Let
−
this reconstruction is perfect when the windows are symmetric and identical, thus g(n)=h(n).
A.2 N/4-Point FFT
The antisymmetry of the FFT output coefficients allows that we only compute half the input signals. In order to obtain a formula which is easy to handle, we have chosen to keep the even coefficients. The odd ones are reduced by Eq. (A.2). Hence Eq. (A.1) is equivalent to
,)
which can be rewritten as
− permutation, which is typical in the DCT case
1
Here we will use two symbols:
)
Appendix B
Radix-2 2 and Radix-2 3 FFT
We will describe the radix-22 and radix-23 FFT in detail in this appendix. We will discuss the mathematical derivation to the algorithm. The details can be found in [16] and [17].
B.1 Radix-2 2 FFT
At first, we will see the analytical expression for the FFT is
−
and the analytical expression for the IFFT is
−
The derivation of the radix-22 FFT algorithm starts with a substitution with a 3-dimensional index map. The index n and k in Eq. B.1 can be expressed as
n N
When the above substitutions are applied to DFT definition, the definition can be rewritten as
2)
which is a general radix-2 butterfly
Now, the two twiddle factor in Eq. B.6 can be rewritten as
3
Observe that the last twiddle factor in the above Eq. B.5 can be rewritten.
3 DFT definition with four times shorter.
−
The result is that the butterflies have the following structure. The PE2 butterfly takes the input from two PE1 butterflies.
4 )]
These calculations are for first radix-22 butterfly, or components the PE1 and PE2 butterflies. The PE1 is the one represented by the formulas in brackets in Eq. B.10 and PE2 is the outer computation in the same equation. The complete radix-22 algorithm is derived by applying this procedure recursively.
n N
When the above substitutions are applied to DFT definition, the definition can be rewritten as
is a general radix-2 butterfly
Now, the two twiddle factor in Eq-. B.13 can be rewritten as
)
Substitute Eq. B.14 into Eq. B.13, and expand the summation with regard to index n1, n2 and n3. After simplification we have a set of 8 DFT of length N/8.
There a third butterfly structure has the expression of
)
As in the Radix-22 FFT algorithm, Eq. B.6 and Eq. B.10 represent the first two columns of butterflies with only trivial multiplications in the Radix-23 FFT algorithm. The third butterfly contains a special twiddle factor
)