國 立 交 通 大 學
電子工程學系 電子研究所碩士班
碩
士
論
文
MPEG-4 先進音訊編碼
在 DSP/FPGA 平台上的實現與最佳化
MPEG-4 AAC Implementation
and Optimization on DSP/FPGA
研 究 生:曾建統
指導教授:杭學鳴 博士
MPEG-4 先進音訊編碼
在 DSP/FPGA 平台上的實現與最佳化
MPEG-4 AAC Implementation
and Optimization on DSP/FPGA
研 究 生:曾建統 S tu d en t:Chien-Tung Tseng
指 導教 授:杭學鳴 博士 Advisor:Dr. Hsueh-Ming Hang
國 立 交 通 大 學
電子工程學系 電子研究所碩士班
碩 士 論 文
A Thesis
Submitted to Institute of Electronics
College of Electrical Engineering and Computer Science
National Chiao Tung University
in Partial Fulfillment of Requirements
for the Degree of
Master of Science
in
Electronics Engineering
June 2004
MPEG-4 先進音訊編碼在
DSP/FPGA 平台上的實現與最佳化
學生:曾建統 指導教授:杭學鳴 博士
國立交通大學 電子工程學系電子研究所碩士班
摘要
MPEG-4 先進音訊編碼(AAC)是由 ISO/IEC MPEG 所制訂的一套非常有效率的
音訊壓縮編碼標準。
在本篇論文當中,我們首先統計 MPEG-4 先進音訊編碼在 DSP 上的執行情況,
發現霍夫曼解碼(Huffman decoding)和反修正離散餘弦轉換(IMDCT)所需要的時
脈週期總數為最多,因為針對反修正離散餘弦轉換在 DSP 上的實現作最佳化,同
時我們也希望利用 FPGA 來克服用 DSP 執行的瓶頸部分,所以將霍夫曼解碼以及
反修正離散餘弦轉換的一部份反快速傅立葉轉換(IFFT)放到 FPGA 實現。
在 DSP 實現方面,我們針對 DSP 的架構使用運算量更少的演算法,使用適合
DSP 處理的資料型態,並使用 TI DSP 特殊指令來改寫程式,大幅提高其執行效
率,這個部分大約增加了 503 倍的速度。在 FPGA 實現方面,我們設計針對霍夫
曼解碼以及反快速傅立葉轉換的架構,並針對硬體架構設計來作調整,使其運算
效能提高,同時兼顧減少使用面積的考量。霍夫曼解碼大約比 DSP 的版本增加了
56 倍的速度,反快速傅立葉轉換大約較 DSP 最快的版本增加了 4 倍的速度。最
後並考慮 DSP 和 FPGA 設計之間的溝通問題。
MPEG-4 AAC Implementation
and Optimization on DSP/FPGA
Student: Chien-Tung Tseng Advisor:Dr. Hsueh-Ming Hang
Department of Electronics Engineering
Institute of Electronics
National Chiao Tung University
Abstract
MPEG-4 AAC (Advanced Audio Coding) is an efficient audio coding standard. It
is defined by the MPEG (Moving Pictures Experts Groups) committee, which is one
of ISO (International Standard Organization) working groups. In this thesis, we first
analyze the computational complexity of MPEG-4 AAC decoder program. We found
that the Huffman decoding and the IMDCT (inverse modified discrete cosine
transform) require the most clock cycles to execute on DSP. Hence, we optimize the
IMDCT codes on DSP. In addition, we use FPGA to remove the bottleneck in DSP
execution. Thus, we implement the Huffman decoding and the inverse fast Fourier
transform), which is a part of IMDCT, on FPGA.
In order to speed up the AAC decoder on DSP, we need to choose appropriate
algorithms for DSP implementation. Thus, appropriate data types are chosen to
present the data. Furthermore, we use the TI (Texas Instruments) DSP intrinsic
functions to increase the DSP execution efficiency. The modified version of IMDCT
is about 503 times faster than the original version. For the FPGA implementation, we
adopt and modify the existing architectures for Huffman decoding and 512-point IFFT.
In addition, we use VLSI design techniques to improve the performance and reduce
誌謝
本論文承蒙恩師杭學鳴教授細心的指導與教誨,方得以順利完成。在研究所
生涯的兩年中,杭教授不僅在學術研究上給予學生指導,在研究態度亦給許相當
多的建議,在此對杭教授獻上最大的感謝之意。
此外,感謝所有通訊電子暨訊號處理實驗室的成員,包括多位師長、同學、
學長姊和學弟妹們,特別是楊政翰、陳繼大、吳俊榮、蔡家揚學長給予我在研究
過程中的指導與建議。同時也要感謝實驗室同窗仰哲、明瑋、子瀚、筱晴、盈縈、
明哲、宗書在遇到困難的時候能夠互相討論和砥礪,並希望接下我們工作的學弟
盈閩、志楹、昱昇學弟能傳承實驗室認真融洽的氣氛,在學術上有所貢獻。感謝
我的女朋友佳韻,在生活中給予我的支持與鼓勵,使我在艱難的研究過程中,能
夠保持身心的健康與平衡。
謝謝養育我多年的父親及母親,還有我的弟弟,沒有你們的栽培與鼓勵,我
無法有今天的成就。
要感謝的人很多,無法一一列述,謹以這篇論文,獻給全部讓我在研究所生
涯中難忘的人,謝謝。
曾建統
民國九十三年六月 於新竹
Contents
Chapter 1 Introduction ... 1
Chapter 2 MPEG-2/4 Advanced Audio Coding...3
2.1 MPEG-2 AAC...3
2.1.1 Gain Control...4
2.1.2 Filterbank ...5
2.1.3 Temporal Noise Shaping (TNS)...7
2.1.4 Intensity Coupling...8
2.1.5 Prediction ... 8
2.1.6 Middle/Side (M/S) Tool ...9
2.1.7 Scalefactors ...10
2.1.8 Quantization...10
2.1.9 Noiseless Coding ...10
2.2 MPEG-4 AAC Version 1... 11
2.2.1 Long Term Prediction (LTP) ... 12
2.2.2 Perceptual Noise Substitution (PNS) ...13
2.2.3 TwinVQ...14
2.3 MPEG-4 AAC Version 2...15
2.3.1 Error Robustness...15
2.3.2 Bit Slice Arithmetic Coding (BSAC)...16
2.3.3 Low-Delay Audio Coding...17
2.4 MPEG-4 AAC Version 3...17
Chapter 3 Introduction to DSP/FPGA ...19
3.1 DSP Baseboard ...19
3.2 DSP Chip... 20
3.4 Data Transmission Mechanism ... 28
3.4.1 Message Interface ...29
3.4.2 Streaming Interface ... 29
Chapter 4 MPEG-4 AAC Decoder Implementation and Optimization on DSP ... 31
4.1 Profile on DSP ... 31
4.2 Optimizing C/C++ Code ... 32
4.2.1 Fixed-point Coding ... 32
4.2.2 Using Intrinsic Functions... 33
4.2.3 Packet Data Processing ... 33
4.2.4 Loop Unrolling and Software Pipelining... 34
4.2.5 Linear Assembly and Assembly... 34
4.3 Huffman Decoding... 35
4.4 IMDCT... 36
4.4.1 N/4-point FFT Algorithm for MDCT ... 37
4.4.2 Radix-2
3FFT ... 39
4.4.3 Implementation of IMDCT with Radix-2 IFFT...41
4.4.4 Implementation of IMDCT with Radix-2
3IFFT ... 41
4.4.5 Modifying of the Data Calculation Order... 42
4.4.6 Using Intrinsic Functions... 43
4.4.7 IMDCT Implementation Results... 44
4.5 Implementation on DSP... 45
Chapter 5 MPEG-4 AAC Implementation and Optimization on DSP/FPGA ... 47
5.1 Huffman Decoding... 47
5.1.1 Integration Consideration... 47
5.1.2 Fixed-output-rate Architecture... 49
5.1.3 Fixed-output-rate Architecture Implementatiopn Result ... 51
5.1.4 Variable-output-rate Architecture... 52
5.1.5 Variable-output-rate Architecture Implementation Result ... 54
5.2 IFFT ... 55
5.2.1 IFFT Architecture... 55
5.2.2 Quantization Noise Analysis... 57
5.2.3 Radix-2
3SDF SDF IFFT Architecture... 59
5.2.4 IFFT Implementation Result... 62
5.3 Implementation on DSP/FPGA... 65
Chapter 6 Conclusions and Future Work ...67
Bibliography ...69
Appendix A N/4-point FFT Algorithm for MDCT ... 71
List of Tables
Table 4.1 Profile of AAC decoding on C64x DSP... 32
Table 4.2 Processing time on the C64x DSP with different datatypes... 33
Table 4.3 Comparison of computational load of FFT... 40
Table 4.4 DSP implementation result of different datatypes ... 41
Table 4.5 SNR of IMDCT of different datatypes... 41
Table 4.6 DSP implementation result of different datatypes ... 42
Table 4.7 SNR of IMDCT of different datatypes... 42
Table 4.8 DSP implementation results of the modified data calculation order... 42
Table 4.9 DSP implementation results of using intrinsic functions ... 44
Table 4.10 DSP implementation results of IMDCT...45
Table 4.11 Comparison of modification IMDCT and IMDCT with TI IFFT library 45
Table 4.12 Comparison of original and the optimized performance ... 46
Table 4.13 The ODG of test sequence “guitar” ...46
Table 4.14 The ODG of test sequence “eddie_rabbitt”...46
Table 5.1 The performance Comparison of DSP and FPGA implementation ... 52
Table 5.2 Comparison of hardware requirements ... 56
Table 5.3 The performance comparison of DSP and FPGA implementation ... 64
List of Figures
Fig. 2.1 Block diagram for MPEG-2 AAC encoder... 4
Fig. 2.2 Block diagram of gain control tool for encoder ... 5
Fig. 2.3 Window shape adaptation process... 6
Fig. 2.4 Block switching during transient signal conditions... 7
Fig. 2.5 Pre-echo distortion ... 7
Fig. 2.6 Prediction tool for one scalefactor band ... 9
Fig. 2.7 Block diagram of MPEG-4 GA encoder... 12
Fig. 2.8 LTP in the MPEG-4 General Audio encoder ... 13
Fig. 2.9 TwinVQ quantization scheme ... 15
Fig. 3.1 Block Diagram of Quixote ... 20
Fig. 3.2 Block diagram of TMS320C6x DSP ...21
Fig. 3.3 TMS320C64x CPU Data Path...23
Fig. 3.4 Functional Units and Operations Performed ... 24
Fig. 3.5 Functional Units and Operations Performed (Cont.)... 25
Fig. 3.6 General Slice Diagram...28
Fig. 4.1 Intrinsic functions of the TI C6000 series DSP (Part.) ... 33
Fig. 4.2 Sequential model of Huffman decoder ... 35
Fig. 4.3 Parallel model of Huffman decoder... 36
Fig. 4.4 Fast MDCT algorithm ... 38
Fig. 4.5 Fast IMDCT algorithm ... 39
Fig. 4.6 Butterflies for 8-point radix-2 FFT... 40
Fig. 4.7 Butterflies for a radix-2
3FFT PE ... 40
Fig. 4.8 Simplified data flow graph for 8-point radix-2
3FFT ... 40
Fig. 4.9 Comparison of the data calculation order... 42
Fig. 4.10 Intrinsic functions we used ... 44
Fig. 4.11 TI IFFT library... 45
Fig. 5.1 Flow diagram of MPEG-4 AAC Huffman decoding... 48
Fig. 5.2 Block diagram of DSP/FPGA integrated Huffman decoding... 49
Fig. 5.3 Block diagram of fixed-output-rate architecture ...50
Fig. 5.5 Waveform of the fixed-output-rate architecture ... 51
Fig. 5.6 Synthesis report of the fixed-output-rate architecture ... 51
Fig. 5.7 P&R report of the fixed-output-rate architecture ... 52
Fig. 5.8 Block diagram of the variable-output-rate architecture... 53
Fig. 5.9 Comparison of the waveform of the two architectures... 53
Fig. 5.10 Synthesis report for the variable-output-rate architecture ... 54
Fig. 5.11 P&R report for the variable-output-rate architecture... 55
Fig. 5.12 Block diagram of shifter-adder multiplier ... 57
Fig. 5.13 Quantization noise analysis of twiddle multiplier is 256 ... 58
Fig. 5.14 Quantization noise analysis of twiddle multiplier is 4096 ...58
Fig. 5.15 Block diagram of radix2
3SDF 512-point IFFT pipelined architecture... 59
Fig. 5.16 Simplified data flow graph for each PE... 59
Fig. 5.17 Block diagram of the PE1... 60
Fig. 5.18 Block diagram of the PE2...61
Fig. 5.19 Block diagram of the PE3...61
Fig. 5.20 Block diagram of the twiddle factor multiplier ...62
Fig. 5.21 Waveform of the radix-2
3512-point IFFT...62
Fig. 5.22 Synthesis report of radix-2
3512-point IFFT... 63
Chapter 1
Introduction
MPEG stands for ISO “Moving Pictures Experts Groups.” It is a group working under the directives of the International Standard Organization (ISO) and the International Electro-technical Commission (IEC). This group work concentrates on defining the standards for coding moving pictures, audio and related data.
The MPEG-4 AAC (Advanced Audio Coding) standard is a very efficient audio coding standard at the moment. Similar to many other audio coding schemes, MPEG-4 AAC compresses audio data by removing the redundancy among samples. In addition, it includes several tools to enhance the coding performance, temporal noise shaping (TNS), perceptual noise substitution (PNS), spectral band replication (SBR) and others. Hence, the MPEG-4 AAC standard can compress audio data at high quality with high compression efficiency.
We implement the MPEG-4 AAC encoder and decoder on a DSP processor. Some of the MPEG-4 AAC tools’ efficiencies are limited by the data processing mechanism of the DSP processors. In this project, we try to use VLSI (very large scale integration) design concept to improve the implementation. The idea is based on the SoC (System on a Chip) methodology.
We thus adopt the DSP/FPGA (Digital Signal Processor/Field Programmable Gate Array) platform to implement MPEG-4 AAC encoder and decoder. The DSP baseboard is made by Innovative Integration's Quixote. It houses a Texas Instruments' TMS320C6416 DSP and a Xilinx Virtex-II FPGA. We also need the communication interface provided by the DSP baseboard manufacture. This thesis will describe the implementation and optimization of an AAC decoder on the DSP and on the FPGA.
The organization of the thesis is as follows. In chapter 2, we describe the operations of MPEG-2 AAC and MPEG-4 AAC. Then, in chapter 3, we describe the DSP/FPGA environment. In chapter 4, we speed up the decoder process on DSP. In chapter 5, we include
FPGA for implementing Huffman decoding and IFFT to improve the overload performance. At the end, we give a conclusion and future work of our system.
Chapter 2
MPEG-2/4
Advanced Audio Coding
In this chapter, we will briefly describe the MPEG-2/4 AAC (Advanced Audio Coding) operating mechanism. Details can be found in [1] and [2] respectively.
2.1 MPEG-2 AAC
In 1994, a MPEG-2 audio standardization committee defined a high quality multi-channel standard without MPEG-1 backward compatiblility. It was the beginning of the development of “MPEG-2 AAC.” The aim of MPEG-2 AAC was to reach “indistinguishable” audio quality at data rate of 384 kbps or lower for five full-bandwidth channel audio signals as specified by the ITU-R (International Telecommunication Union, Radio-communication Bureau). Testing result showed that MPEG-2 AAC needed 320 kbps to achieve the ITU-R quality requirements. This result showed that MPEG-2 AAC satisfied the ITU-R standard, and then MPEG-2 AAC was finalized in 1997.
Like most digital audio coding schemes, MPEG-2 AAC algorithm compresses audio signals by removing the redundancy between samples and the irrelevant audio signals. We can use time-frequency analysis for removing the redundancy between samples, and make use of the signal masking properties of human hearing system to remove irrelevant audio signals. In order to allow tradeoff between compression the audio quality, the memory requirement and
the processing power requirement, the MPEG-2 AAC system offers three profiles: main profile, low-complexity (LC) profile, and scalable sampling rate (SSR) profile. Fig 2.1 gives an overview of a MPEG-2 AAC encoder block diagram. We will describe each tool briefly in this section.
Fig. 2.1 Block diagram for MPEG-2 AAC encoder [1]
2.1.1 Gain Control
The gain control tool receives the time-domain signals, and outputs gain control data and signal whose length is equal of the modified discrete cosine transform (MDCT) window. Fig
gain control tool can be applied to each of four bands independently.
The tool is only available for the SSR profile because of the features of SSR profile. If we need lower bandwidth for output signals, lower sampling rate signals can be obtained by draping the signal from the upper bands of the PQF. The advantage of this scalability is that the decoder complexity can be reduced as the output bandwidth is reduced.
Fig. 2.2 Block diagram of gain control tool for encoder [2]
2.1.2 Filterbank
The filterbank tool converts the time-domain signals into a time-frequency representation. This conversion is done by a MDCT (modified discrete cosine transform), which employs TDAC (time-domain aliasing cancellation) technique.
In the encoder, this filterbank takes in a block of time samples, modulates them by an appropriate window function, and performs the MDCT to ensure good frequency selectivity. Each block of input samples is overlapped by 50% with the immediately preceding block and the following block in order to reduce the boundary effect. Hence in the decoder, adjacent blocks of samples are overlapped and added after inverse MDCT (IMDCT).
The mathematical expression for the MDCT is
(
)
1 , 0,1,..., 1 2 cos 2 0 1 , , = + + = − − N k k n n x Xik N in π(2.1) The mathematical expression of the IMDCT is
(2.2) where
n = sample index
N = transform block length i = block index
k = coefficient index n0 = (N/2+1)/2
Since the window function has a significant effect on the filterbank frequency response, the filterbank has been designed to allow a change in window length and shape to adapt to input signal condition. There are two different lengths and two different shapes for window selection. Relatively short windows suit to signals in transient, and the relatively long ones suit to signals in steady-state. The sine windows are narrow passband selective, and the other choices Kaiser-Bessel Derived (KBD) windows are strong stopband attenuated.
(
)
, 0,1,..., 1 2 1 2 cos 2 0 1 2 / 0 , , = + + = − − = N n k n n N X N x N k ik n i πFig. 2.4 Block switching during transient signal conditions [2]
2.1.3 Temporal Noise Shaping (TNS)
The temporal noise shape (TNS) is used to control the temporal shape of the quantization noise within each window of the transform. This is done by applying a filtering process to parts of the spectral data of each channel.
To handle the transient and pitched signals is a major challenge in audio coding. This is due to the problem of maintaining the masking effect in the reproduced audio signals. Because of the temporal mismatch between masking threshold and quantization noise, the phenomenon is called by “pre-echo” problem. Fig 2.5 illustrates this phenomenon, the left figure shows the original temporal signals in a window, and the right figure shows the quantized spectral coefficients transform to the time domain.
The duality between time domain and frequency domain is used in predictive coding techniques. The signals with an “unflat” spectrum can be coded efficiently either by directly coding the spectral coefficients or predictive coding the time domain signals. According to the duality property, the signals with an “unfla” time structure, like transient signals, can be coded efficiently either by directly coding time-domain samples or applying predictive coding to the spectral coefficients. The TNS tool uses prediction mechanism over frequency-domain to enhance its temporal resolution.
In addition, if predictive coding is applied to spectral coefficients, the temporal noise will adapt to the temporal signal when decoded. Hence the quantization noise is put into the original signal, and in this way, the problem of temporal noise in transient or pitched signals can be avoided.
2.1.4 Intensity Coupling
The human hearing system is sensitive to amplitude and phase of low frequency signals. It also sensitive to amplitude of high frequency signals, but insensitive to phase. The intensity coupling tool is used to exploit irrelevance between high frequency signals of each pair of channels. It adds high frequency signals from left and right channel and multiplies to a factor to rescale the result. The intensity signals are used to replace the corresponding left channel high frequency signals, and corresponding signals of the right channel are set to zero.
2.1.5 Prediction Tool
Prediction tool is used for improved redundancy reduction in spectral coefficients. If the spectral coefficients are stationary between adjacent frames, the prediction tool will estimate
For each channel, there is one predictor corresponding to the spectral component from the spectral decomposition of the filterbank. The predictor exploits the autocorrelation between the spectral component values of consecutive frames. The predictor coefficients are calculated from preceding quantized spectral components in the encoder. In this case, the spectral component can be recovered in the decoder without other predictor coefficients. A second-order backward-adaptive lattice structure predictor is working on the spectral component values of the two preceding frames. The predictor parameters are adapted to the current signal statistics on a frame-by-frame base, using an LMS-based adaptation algorithm. If prediction is activated, the quantizer is fed with a prediction error instead of the original spectral component, resulting in a higher coding efficiency.
Fig. 2.6 Prediction tool for one scalefactor band [2]
2.1.6 Middle/Side Tool
There are two different choices to code each pair of the multi-channel signals, the original left/right (L/R) signals or the transformed middle/side (M/S) signals. If the high correlated left and right signals could be summed, the require bits to code this signals will be less. Hence in the encoder, the M/S tool will operate when the left and right signals’ correlation is higher than a threshold. The M/S tool transform the L/R signals to M/S signals, where the middle signal equals to the sum of left and right signals, and the side signal equals to the difference of left and right ones.
2.1.7 Scalefactors
The human hearing system can be modeled as several over-lapped bandpass filters. With higher central frequency, each filter has larger bandwidth. These bandpass filters are called critical bands. The scalefactors tool divides the spectral coefficients into groups, called scalefactor bands, to imitate critical bands. Each scalefactor band has a scalefactor, and all the spectral coefficients in the scalefactor band are divided by this corresponding scalefactor. By adjusting the scalefactors, quantization noise can be modified to meet the bit-rate and distortion constraints.
2.1.8 Quantization
While all previous tools perform some kind of preprocessing of audio data, the real bit-rate reduction is achieved by the quantization tool. On the one hand, we want to quantize the spectral coefficients in such a way that quantization noise under the masking threshold; on the other hand, we want to limit the number of bits requested to code this quantized spectral coefficients.
There is no standardized strategy for gaining optimum quantization. One important issue is the tuning between the psychoacoustic model and the quantization process. The main advantage of nonuniform quantizer is the built-in noise shaping depending on the spectral coefficient amplitude. The increase of the signal-to-noise ratio with rising signal energy is much lower values than in a linear quantizer.
2.1.9 Noiseless Coding
their location. Since the side information for carrying the clipped spectral coefficients costs some bits, this compression is applied only if it results in a net saving of bits.
The Huffman coding is used to represent n-tuples of quantized spectral coefficients, with 12 codebooks can be used. The spectral coefficients within n-tuples are ordered from low frequency to high frequency and the n-tuple size can be two or four spectral coefficients. Each codebook specifies the maximum absolute value that it can represent and the n-tuple size. Two codebooks are available for each maximum absolute value, and represent two distinct probability distributions. Most codebooks represent unsigned values in order to save codebook storage. Sign bits of nonzero coefficients are appended to the codeword.
2.2 MPEG-4 AAC Version 1
MPEG-4 AAC Version 1 was approved in 1998 and published in 1999. It has all the tools of MPEG-2 AAC. It includes additional tools such as the long term predictor (LTP) tool, perceptual noise substitution (PNS) tool and transform-domain weighted interlaced vector quantization (TwinVQ) tool. The TwinVQ tool is an alternative tool for the MPEG-4 AAC quantization tool and noiseless coding tool. This new scheme which combined AAC with TwinVQ is officially called "General Audio (GA)." We will introduce these new tools in this section.
Fig. 2.7 Block diagram of MPEG-4 GA encoder [2]
2.2.1 Long Term Prediction
The long term prediction (LTP) tool uses to exploit the redundancy in the speech signal which is related to the signal periodicity as expressed by the speech pitch. In speech coding, the sounds are produced in a periodical way so that the pitch phenomenon is obvious. Such phenomenon may exist in audio signals as well.
Fig. 2.8 LTP in the MPEG-4 General Audio encoder [2]
The LTP tool performs prediction to adjacent frames while MPEG-2 AAC prediction tool perform prediction on neighboring frequency components. The spectral coefficients transform back to the time-domain representation by inverse filterbank and the associated inverse TNS tool operations. Comparing the locally decoded signal to the input signal, the optimum pitch lag and gain factor can be determined. The difference between the predicted signal and the original signal then is calculated and compared with the original signal. One of them is selected to be coded on a scalefactor band basis depending on which alternative is more favorable.
The LTP tool provides considerable coding gain for stationary harmonic signals as well as some non-harmonic tonal signals. Besides, the LTP tool is much less computational complexity than original prediction tool.
2.2.2 Perceptual Noise Substitution
The perceptual noise substitution (PNS) tool gives a very compact representation of noise-like signals. In this way, the PNS tool provides that increasing of the compression efficiency for some type of input signals.
In the encoder, the noise-like component of the input signal is detected on a scalefactor band basis. If spectral coefficients in a scalefactor band are detected as noise-like signals, they will not be quantized and entropy coded as usual. The noise-like signals omit from the quantization and entropy coding process, but coded and transmitted a noise substitution flag and the total power of them.
In the decoder, a pseudo noise signal with desired total power is inserted for the substituted spectral coefficients. This technique results in high compression efficiency since only a flag and the power information is coded and transmitted rather than whole spectral coefficients in the scalefactor band
2.2.3 TwinVQ
The TwinVQ tool is an alternative quantization/coding kernel. It is designed to provide good coding efficiency at very low bit-rate (16kbps or even lower to 6kbps). The TwinVQ kernel first normalizes the spectral coefficients to a specified range, and then the spectral coefficients are quantized by means of a weighted vector quantization process.
The normalization process is carried out by several schemes such as linear predictive coding (LPC) spectral estimation, periodic component extraction, Bark-scale spectral estimation, and power estimation. As a result, the spectral coefficients are "flattened" and normalized across the frequency axis.
The weighted vector quantization process is carried out by interleaving the normalized spectral coefficients and dividing them into sub-vectors for vector quantization. For each sub-vector, a weighted distortion measure is applied to the conjugate structure VQ which uses a pair of code books. Perceptual control of quantization noise is achieved in this way. The process is shown in Fig 2.9.
Fig. 2.9 TwinVQ quantization scheme [2]
2.3 MPEG-4 AAC Version 2
MPEG-4 AAC Version 2 was finalized in 1999. Compared to MPEG-4 Version 1, Version 2 adds several new tools in the standard. They are Error Robustness tool, Bit Slice Arithmetic Coding (BSAC) tool, Low Delay AAC (LD-AAC). The BSAC tool is for fine-grain bitrate scalability, and the LD-AAC for coding of general audio signals with low delay. We will introduce these new tools in this section.
2.3.1 Error Robustness
The Error Robustness tools provide improved performance on error-prone transmission channels. The two classes of tools are the Error Resilience (ER) tool and Error Protection (EP) tool.
The ER tool reduces the perceived distortion of the decoded audio signal that is caused by corrupted bits in the bitstream. The following tools are provided to improve the error robustness for several parts of an AAC bitstream frame: Virtual CodeBook (VCB), Reversible Variable Length Coding (RVLC), and Huffman Codeword Reordering (HCR). These tools
allow the application of advanced channel coding techniques, which are adapted to the special needs of the different coding tools.
The EP tool provides Unequal Error Protection (UEP) for MPEG-4 Audio. UEP is an efficient method to improve the error robustness of source coding schemes. It is used by various speech and audio coding systems operating over error-prone channels such as mobile telephone networks or Digital Audio Broadcasting (DAB). The bits of the coded signal representation are first grouped into different classes according to their error sensitivity. Then error protection is individually applied to the different classes, giving better protection to more sensitive bits.
2.3.2 Bit Slice Arithmetic Coding Tool
The Bit-Sliced Arithmetic Coding (BSAC) tool provides efficient small step scalability for the GA coder. This tool is used in combination with the AAC coding tools and replaces the noiseless coding of the quantized spectral data and the scalefactors. The BSAC tool provides scalability in steps of 1 kbps per audio channel, which means 2 kbps steps for a stereo signal. One base layer bitstream and many small enhancement layer bitstreams are used. The base layer contains the general side information, specific side information for the first layer and the audio data of the first layer. The enhancement streams contain only the specific side information and audio data for the corresponding layer.
To obtain fine step scalability, a bit-slicing scheme is applied to the quantized spectral data. First the quantized spectral coefficients are grouped into frequency bands. Each of group contains the quantized spectral coefficients in their binary representation. Then the bits of a group are processed in slices according to their significance. Thus all of the most significant bits (MSB) of the quantized spectral coefficients in each group are processed. Then these bit-slices are encoded by using an arithmetic coding scheme to obtain entropy coding with
coefficients are refined by providing more less significant bits (LSB), and the bandwidth is increased by providing bit-slices of the spectral coefficients in higher frequency bands.
2.3.3 Low-Delay Audio Coding
The MPEG-4 General Audio Coder provides very efficient coding of general audio signals at low bitrates. However it has an algorithmic delay of up to several 100ms and is thus not well suited for applications requiring low coding delay, such as real-time bi-directional communication. To enable coding of general audio signals with an algorithmic delay not exceeding 20 ms, MPEG-4 Version 2 specifies a Low-Delay Audio Coder which is derived from MPEG-2/4 Advanced Audio Coding (AAC). It operates at up to 48 kHz sampling rate and uses a frame length of 512 or 480 samples, compared to the 1024 or 960 samples used in standard MPEG-2/4 AAC. Also the size of the window used in the analysis and synthesis filterbank is reduced by a factor of 2. No block switching is used to avoid the “look-ahead” delay due to the block switching decision. To reduce pre-echo phenomenon in case of transient signals, window shape switching is provided instead. For non-transient parts of the signal a sine window is used, while a so-called low overlap window is used in case of transient signals. Use of the bit reservoir is minimized in the encoder in order to reach the desired target delay. As one extreme case, no bit reservoir is used at all.
2.4 MPEG-4 AAC Version 3
MPEG-4 AAC Version 3 was finalized in 2003. Like MPEG-4 Version2, Version 3 adds some new tools to increase the coding efficiency. The main tool is SBR (spectral band replication) tool for a bandwidth extension at low bitrates encodings. This result scheme is called High-Efficiency AAC (HE AAC).
The SBR (spectral band replication) tool improves the performance of low bitrate audio by either increasing the audio bandwidth at a given bitrate or by improving coding efficiency at a given quality level. When the MPEG-4 AAC attaches to SBR tool, the encoders encode
lower frequency bands only, and then the decoders reconstruct the higher frequency bands based on an analysis of the lower frequency bands. Some guidance information may be encoded in the bitstream at a very low bitrate to ensure the reconstructed signal accurate. The reconstruction is efficient for harmonic as well as for noise-like components and allows for proper shaping in the time domain as well as in the frequency domain. As a result, SBR tool allows a very large bandwidth audio coding at low bitrates.
Chapter 3
Introduction to
DSP/FPGA
In our system, we will use Digital Signal Processor/Field Programmable Gate Array (DSP/FPGA) to implement MPEG-4 AAC encoder and decoder. The DSP baseboard is made by Innovative Integration's Quixote, which houses Texas Instruments' TMS320C6416 DSP and Xilinx Virtex-II FPGA. In this chapter, we will describe DSP baseboard, DSP chip and FPGA chip. At the end, we will introduce the data transmission between the Host PC and the DSP/FPGA
3.1 DSP Baseboard
Quixote combines one TMS320C6416 600MHz 32-bit fixed-point DSP with a Xilinx Virtex-II XC2V2000/6000 FPGA on the DSP baseboard. Utilizing the signal processing technology to provide processing flexibility, efficiency and deliver high performance. Quixote has 32MB SDRAM for use by DSP and 4 or 8Mbytes zero bus turnaround (ZBT) SBSRAM for use by FPGA. Developers can build complex signal processing systems by integrating these reusable logic designs with their specific application logic.
Fig. 3.1 Block Diagram of Quixote [5]
3.2 DSP Chip
The TMS320C64x fixed-point DSP is using the VelociTI architecture. The VelociTI architecture of the C6000 platform of devices use advanced VLIW (very long instruction word) to achieve high performance through increased instruction-level parallelism, performing multiple instructions during a single cycle. Parallelism is the key to extremely high performance, taking the DSP well beyond the performance capabilities of traditional superscalar designs. VelociTI is a highly deterministic architecture, having few restrictions on how or when instructions are fetched, executed, or stored. It is this architectural flexibility that
Fig 3.2 Block diagram of TMS320C6x DSP [6]
TMS320C6416 has internal memory includes a two-level cache architecture with 16 KB of L1 data cache, 16 KB of L1 program cache, and 1 MB L2 cache for data/program allocation. On-chip peripherals include two multi-channel buffered serial ports (McBSPs), two timers, a 16-bit host port interface (HPI), and 32-bit external memory interface (EMIF). Internal buses include a 32-bit program address bus, a 256-bit program data bus to accommodate eight 32-bit instructions, two 32-bit data address buses, two 64-bit data buses, and two 64-bit store data buses. With 32-bit address bus, the total memory space is 4 GB, including four external memory spaces: CE0, CE1, CE2, and CE3. We will introduce several important parts in this section.
3.2.1 Central Processing Unit (CPU)
Fig. 3.2 shows the CPU, and it contains Program fetch unit
Instruction dispatch unit, advanced instruction packing Instruction decode unit
Two data path, each with four functional units 64 32-bit registers
Control registers Control logic
Test, emulation, and interrupt logic
The program fetch, instruction dispatch, and instruction decode units can deliver up to eight 32-bit instructions to the functional units every CPU clock cycle. The processing of instructions occurs in each of the two data paths (A and B), each of which contains four functional units (.L, .S, .M, and .D) and 32 32-bit general-purpose registers. Fig. 3.3 shows the comparison of C62x/C67x with C64x CPU.
3.2.2
Data Path
Fig 3.3 TMS320C64x CPU Data Path [6]
There are two general-purpose register files (A and B) in the C6000 data paths. The C64x DSP register is double the number of general-purpose registers that are in the C62x/C67x cores, with 32 32-bit registers (A0-A31 for file A and B0-B31 for file B).
There are eight independent functional units divided into two data paths. Each path has a unit for multiplication operations (.M), for logical and arithmetic operations (.L), for branch, bit manipulation, and arithmetic operations (.S), and for loading/storing and arithmetic
operations (.D). The .S and .L units are for arithmetic, logical, and branch instructions. All data transfers make use of the .D units. Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. It can be a maximum of two cross-path source reads per cycle. Fig. 3.4 and 3.5 show the functional unit and its operations.
Fig. 3.5 Functional Units and Operations Performed (Cont.) [7]
3.2.3 Pipeline Operation
Pipelining is the key feature to get parallel instructions working properly, requiring careful timing. There are three stages of pipelining: program fetch, decode, and execute, and each stage contains several phases. We will describe the function of the three stages and their associated multiple phases in the section.
The fetch stage is composed of four phases PG: Program address generate
PS: Program address send PW: Program address ready wait PR: Program fetch packet receive
During the PG phase, the program address is generated in the CPU. In the PS phase, the program address is sent to memory. In the PW phase, a memory read occurs. Finally, in the PR phase, the fetch packet is received at the CPU.
The decode stage is composed of two phases. DP: Instruction dispatch
DC: Instruction decode
During the DP phase, the instructions in execute packet are assigned to the appropriate functional units. In the DC phase, the source registers, destination registers, and associated paths are decoded for the execution of the instructions in the functional units.
The execute stage is composed of five phases. E1: Single cycle instruction complete. E2: Multiply instruction complete. E3: Store instruction complete.
E4: Multiply extensions instruction complete. E5: Load instruction complete.
Different types of instructions require different numbers of these phases to complete their execution. These phases of the pipeline play an important role in your understanding the device state at CPU cycle boundaries.
3.2.4 Internal Memory
The C64x has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When in external (off-chip) memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C64x has two 64-bit internal ports to access internal data memory, and a single port to access internal program memory, with an instruction-fetch width of 256 bits.
16 KB program L1 cache 1M L2 cache
64 EDMA channels 3 32-bit timers
3.3 FPGA
The Xilinx Virtex-II FPGA is made by 0.15µ, 8-layer metal process; it offers logic performance in excess of 300MHz. We will introduce the FPGA logic in this section.
Virtex-II XC2V2000 FPGA contains 2M system gates
56 x 48 CLB array (row x column) 10752 slices
24192 logic cells 21504 CLB flip-flops
336K maximum distributed RAM bits Virtex-II XC2V6000 FPGA contains
6M system gates
96 x 88 CLB array (row x column) 33792 slices
76032 logic cells 675844 CLB flip-flops
1056K maximum distributed RAM bits
Configurable Logic Blocks (CLB) is a block of logic surrounded by routing resources. The functional elements are need to logic circuits. One CLB contains four slices; each slice contains two Logic Cells (LC); each LC includes a 4-input function generator, carry logic, and a storage element.
Fig 3.6 General Slice Diagram [10]
The synthesizer of the Xilinx FPGA is the Xilinx ISE 6.1. The simulation result was reference by the synthesizer report and the P&R report in the ISE.
3.4 Data Transmission Mechanism
3.4.1 Message Interface
The DSP and Host PC have a lower bandwidth communications link for sending commands or side information between host PC and target DSP. Software is provided to build a packet-based message system between the target DSP and the Host PC. A set of sixteen mailboxes in each direction to and from Host PC are shared with DSP to allow for an efficient message mechanism that complements the streaming interface. The maximum data rate is 56 kbps, and the higher data rate requirements should use the streaming interface.
3.4.2 Streaming Interface
The primary streaming interface is based on a streaming model where logically data is an infinite stream between the source and destination. This model is more efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. On the other hand, the streaming model has relatively high latency for a particular piece of data. This is because a data item may remain in internal buffering until subsequence data accumulates to allow for an efficient transfer.
Chapter 4
MPEG-4 AAC Decoder
Implementation and
Optimization on DSP
In this chapter, we will describe the MPEG-4 AAC implementation and optimization on DSP. We will first describe how to optimize the C/C++ code for DSP architecture, and then discuss how to optimize the functions for DSP execution.
4.1 Profile of AAC on DSP
We do the essential modification on the MPEG-4 AAC source C code, and then implement this modified code on DSP. We first optimize the most computational complexity parts of these modified codes. We profile this code by TI CCS profiler. The length of the test sequence is about 0.95 second, and the C64x DSP takes 0.18 second to decode this test sequence. Table 4.1 shows the profile result. We find that the IMDCT and the Huffman decoding require 66% and 21% of total clock cycle respectively. Hence, we optimize these two functions first.
!!" # $ %% & & %
'( ) %%
*(
) &
+ , ) % & %
Table 4.1 Profile of AAC decoding on C64x DSP
4.2 Optimizing C/C++ Code
In this section, we will describe several schemes that we can optimize our C/C++ code and reduce DSP execution time on the C64x DSP. These techniques include the use of fixed-point coding, instrinsic functions, packet data processing, loop unrolling and software pipelining, using linear assembly and the assembly.
4.2.1 Fixed-point Coding
The C64x DSP is a fixed-point processor, so it can do fixed-point processing only. Although the C64x DSP can simulate floating-point processing, it takes a lot of extra clock cycle to do the same job. Table 4.2 is the test results of C64x DSP processing time of assembly instructions “add” and “mul” for different datatypes. It is the processing time without data transfer between external memory and register. The “char”, “short”, “int” and “long” are the fixed-point datatypes, and the “float” and “double” are the floating-point datatypes. We can see clearly that floating-point datatypes need more than 10 times longer
-)) ". ) , /. ), /. &%/. $ /. ! &%/. # . /. ## % " % %
Table 4.2 Processing time on the C64x DSP with different datatypes
4.2.2 Using Intrinsic Functions
TI provides many intrinsic functions to increase the efficiency of code on the C6000 series DSP. The intrinsic functions are optimized code by the knowledge and technique of DSP architecture, and it can be recognize by TI CCS compiler only. So if the C/C++ instructions or functions have corresponding intrinsic functions, we can replace them by intrinsic functions directly. The modification can make our code more efficient substantially. Fig 4.1 shows a part of the intrinsic functions for the C6000 series DSP, and some intrinsic functions are only in the specific DSP.
Fig 4.1 Intrinsic functions of the TI C6000 series DSP (Part.) [6]
4.2.3 Packet Data Processing
The C64x DSP is a 32-bit fixed-point processor, which suit to 32-bit data operation. Although it can do 8-bit, or 16-bit data operations, it will waste some processor resource. So if we can place four 8-bit data or two 16-bit data in a 32-bit space, we can do four or two
operations in one clock cycle. It can improve the code efficiency substantially. One another thing should be mentioned that some of the intrinsic functions have similar way to enhance the efficiency.
4.2.4 Loop Unrolling and Software pipelining
Software pipelining is a scheme to generate efficient assembly code by the compiler so that most of the functional units are utilized within one cycle. For the TI CCS compiler, we can enable the software pipelining function operate or not. If our codes have conditional instructions, sometimes the compiler may not be sure that the branch will be happen or not. It may waste some clock cycles to wait for the decision of branch operation. So if we can unroll the loop, it will avoid some of the overhead for branching. Then the software pipelining will have more efficient result. Besides, we can add some compiler constrains, which tell the compiler that the branch will taken or not, or the loop will run a number of times at least.
4.2.5 Linear Assembly and Assembly
When we are not satisfied with the efficiency of assembly codes which generated by the TI CCS compiler, we can convert some function into linear assembly or optimize the assembly directly. The linear assembly is the input of TI CCS assembly optimizer, and it does not need to specify the parallel instructions, pipeline latency, register usage, and which functional units is being used.
Generally speaking, this scheme is too detail and too time consumption in practice. If we consider project development time, we may skip this scheme. Unless we have strict constrains in processor performance and we have no other algorithm selection, we will do this scheme at
4.3 Huffman Decoding
Generally speaking, the architecture of Huffman decoder can be classified into the sequential model and the parallel model [12]. The sequential model reads in one bit in one clock cycle, so it has a fixed input rate. The parallel model outputs one codeword in one clock cycle, so it has a fixed output rate. Fig. 4.2 and 4.3 show the block diagrams of these two models.
Fig 4.3 Parallel model of Huffman decoder [12]
Because the Huffman code a is variable length code, it means that the codeword is not fixed length for each symbol. Hence the DSP can not know the number of bits in the each codeword in advance. The DSP has to fetch one bit in one clock cycle and compare it with the stored patterns. If there is no matched pattern, the DSP has to fetch the next bit in the next clock cycle and compare with the patterns again. It will take many clock cycles to do the job. The Huffman decoder is restricted by the DSP processing structure, so it belongs to sequential model. We do not find an efficient algorithm for the DSP Huffman decoding scheme, so we plan to implementation the Huffman decoding in the FPGA to enhance the performance of total system.
4.4.1 N/4-point FFT Algorithm for MDCT
We will discuss N/4-point FFT algorithm for MDCT. Since the processing of Yi,k and xi,n
requires a very heavy computational load, we want to find the faster algorithm to replace the original equation. For the fast MDCT algorithm, P. Duhamel had suggested a fast algorithm which uses N/4-point complex FFT (Fast Fourier Transform) to replace MDCT [14]. The key point is that Duhamel found the relationship between N-point MDCT and N/4-point complex FFT. We can thus use the efficient FFT algorithm to enhance the performance of IMDCT. The relationship is valid for N-point IMDCT and N/4-point IFFT.
We will describe the forward operation steps here, and the derivation of this algorithm can be found in Appendix A.
1. Computezn =(xi,2n −xi,N/2−1−2n)+ j(xi,N−1−2n +xi,N/2+2n)
2. Multiply the pre-twiddle: z'=z W4 −(4 +1), n=0,1, ,N/4−1
n N n n
Where W4N =cos(2π/4N)− jsin(2π/4N) 3. Do N/4-point complex FFT: Z'k =FFT{z'n}
4. Multiply the post-twiddle: (( 1) 1 ) ', 0,1, , /4 1
8 1 = − − = +W W − Z k N Z k k N k k
5. The coefficients Yi,2k are found in the imaginary part of Zk, and the coefficients
Yi,2k+N/2 are found in the real part of Zk. The odd part coefficients can be obtained
from Yi,k−1 =−Yi,N−k
Fig 4.4 Fast MDCT algorithm
The inverse operation steps are in a similar way. 1. ComputeZk =−Yi,2k + jYi,N/2−1−2k
2. Multiply the pre-twiddle: ' (( 1) 1 ) , 0,1, , /4 1
8 1 = − − = +W −W Z k N Z k k N k k
3. Do N/4-point complex IFFT: zn =IFFT{Z'k}
4. Multiply the post-twiddle: ' (4 1), 0,1, , /4 1
4 = − = zW + n N z n N n n
5. In the range of n form 1 to N/4, the coefficients xi,3N/4-1-2n are found in the imaginary
part of zn, and the coefficients xi,N/4+2n are found in the real part of zn. In the range of
n from 1 to N/8, the coefficients xi,3N/4+2n are found in the imaginary part of zn, and
the coefficients xi,N/4-1-2n are found in the negative of real part of zn. At last, in the
range of n from N/8 to N/4, the coefficients xi,2n-N/4 are found in the negative of
imaginary part of zn, and the coefficients xi,5N/4-1-2n are found in the real part of zn.
Fig 4.5 Fast IMDCT algorithm
4.4.2 Radix-2
3FFT
There are many FFT algorithms which have been derived in recent years [18]. The radix-2 FFT has the best accuracy, but requires most computations, and the split-radix FFT has fewer computations, but requires irregular butterfly architecture [15]. S. He suggested an FFT algorithm called radix-22 in 1996. It combined radix-2/4 FFT and radix-2 FFT in a
processing element (PE), so it has a more regular butterfly architecture than the split-radix FFT and needs fewer computations than radix-2 FFT. But the radix-22 FFT is suit to the 4N-point only, and our requirement for IFFT is 512-point for long window and 64-point for short window. So we can use radix-23 FFT which derived form radix-22 FFT is suit to 8N-point only.
Fig. 4.6 shows the butterfly of 8-point radix-2 FFT and Fig. 4.7 shows the butterflies of a radix-23 PE. We can see the number of twiddle factor multiplication is decreased in the data
flow graphs. Fig. 4.8 shows the combined split-radix FFT in a radix-23 PE. We can see the regular architecture of butterflies than split-radix. Table 4.3 shows the computational complexity of radix-2 and radix-23 FFT algorithms.
Fig. 4.6 Butterflies for 8-point radix-2 FFT
Fig. 4.7 Butterflies for a radix-23 FFT PE
Fig. 4.8 Simplified data flow graph for 8-point radix-23 FFT
0 #1/% 0 #1/%& * "2 1 " 2 "2 1 ## "2 1 " 2 "2 1 -##
4.4.3 Implementation of IMDCT with Radix-2 IFFT
We first code the 512-point IMDCT with radix-2 IFFT architecture in double datatype to ensure the function is correct for the reasonable input data range. After the function is verified, we modified the datatype from floating-point to fixed-point and calculate the data precision loss in SNR (signal-to-noise ratio). In the fixed-point edition, we multiply a factor of 4096 to all twiddle factors.# ( 3
. % & % % &
(, % %
Table 4.4 DSP implementation result of different datatypes (*0
. % 4 #5 4%#5 (, 4& #5
Table 4.5 SNR of IMDCT of different datatypes
4.4.4 Implementation of IMDCT with Radix-2
3IFFT
Then we code the 512-point IMDCT with the radix-23 IFFT architecture in double datatype to ensure the function is correct in the reasonable input data range. Then we modified the register datatype from floating-point to fixed-point. The data precision loss is the same with the radix-2 FFT. In the fixed-point edition, we multiply a factor of 4096 to all twiddle factors, and multiply a factor of 256 to the 2 in the radix-22 3 PE. The original floating-point datatype edition is slower than radix-2 IFFT might influenced by the coding style of the two architectures.# ( 3
. %
% &
(, % %%
Table 4.6 DSP implementation result of different datatypes (*0
. % 4 #5 &4 #5 (, &4 #5
Table 4.7 SNR of IMDCT of different datatypes
4.4.5 Modifying the Data Calculation Order
We want to the data in the register can be used twice after they are fetch from memory. So we modified the C/C++ code for the data calculation order in each stage. The original calculation order is from the top to the down in the data flow graph. We calculate the first butterfly’s two output data, and then calculate the next butterfly’s two output data. Fig. 4.9 shows the calculation order of the modified edition. The number in the parentheses is the calculation order. In this way, the compiler generates the assembly code which can use the data more efficiency.
Fig. 4.9 Comparison of the old (left) and new (right) data calculation order # ( 3
4.4.6 Using Intrinsic Functions
Since we use the “short” datatype to represent the data in the IMDCT, we may put two 16-bit data in a 32-bit register to improve the performance as packet data processing. At first, we try to use shift the first 16-bit data than add the second 16-bit data into a 32-bit data space. Use one intrinsic function to process these data, and then put the result into two 16-bit data. But the result of this modification is slower than the original version because the data transfer takes too many clock cycles.
So we modify the total IFFT architecture. Put the real part into the 16-bit MSB (maximum significant bit) of 32-bit space, and the imaginary part into the 16-bit LSB (least significant bit). Then use intrinsic functions to do all data process in the IFFT. Fig. 4.10 shows the intrinsic functions we use. At first, we use _pack2 to put two 16-bit data into a 32-bit space. Then we use _add2 and _sub2 to do the 16-bit addition or subtraction. When the data needs to multiply a twiddle factor, we use the _dotp2 or _doptn2 to calculate the sum of product or difference of product. At each stage, we use the _shr2 to divide the data by the factor of 2. At last, we use _bitr to do the bit reverse and put the output data in sequence. Table 4.9 shows the modification result.
Fig. 4.10 Intrinsic functions we used [6] # ( 3
+ $ %
+2 " 3 # % &
in one second on C64x DSP. It is about 530 times faster than the original version. # ( 3
+ $ %
+2 " 3 # % &
Table 4.10 DSP implementation results of IMDCT
Fig. 4.11 TI IFFT library [7]
Then we compare the modification IMDCT to the IMDCT with TI IFFT library as shown in Fig. 4.11. Table 4.11 shows the comparison of the modification IMDCT and the IMDCT using TI IFFT library. The performance has reached about 81% of the IMDCT with TI IFFT library.
# ( 3
6 , %
+2 " 3 # % &
Table 4.11 Comparison of modification IMDCT and IMDCT with TI IFFT library
4.5 Implementation on DSP
We has implemented and optimized MPEG-4 AAC on TI C64x DSP. The optimized result has been shown in Table 4.12. Using the ITU-R BS.1387 PEAQ (perceptual evaluation of audio quality) defined ODG (objective difference grade), we test some sequences on the modified MPEG-4 AAC decoder. The first test sequence is “guitar”; it has sounds variations and is more complex. The second test sequence is “eddie_rabbitt”; it is a pop music with human voice. The test result is shown in Table 4.13 and 4.14. The notation (a) is the original
floating point version, and (b) is the modified integer version. It seems acceptable in the data rate from 32 kbps to 96 kbps. Finally, the overall speed is 2.73 times faster than the original architecture. Note that the IMDCT part is 1/14 of the original in computation, and the result in shown in table 4.14.
! " 0 + $
+2 " 3 # & %& & & & %4& Table 4.12 Comparison of original and the optimized performance
+ 7 .2) &% .2) .2) .2) % .2) .2) .2) % .2) /&4& /&4& / 4 / 4& / 4% / 4 / 4 / 4 . /&4 /&4& / 4 / 4% / 4& / 4& / 4& / 4
Table 4.13 The ODG of test sequence “guitar”
+ 7 .2) &% .2) .2) .2) % .2) .2) .2) % .2) /&4 /&4 / 4 / 4% / 4 / 4 / 4 / 4 . /&4 /&4&& / 4 / 4 / 4& / 4& / 4% / 4&
Chapter 5
MPEG-4 AAC Decoder
Implementation and
Optimization on DSP/FPGA
In the last chapter, we describe the implementation and optimization of the MPEG-AAC decoder on DSP. Also, in this chapter, we will move some of MPEG-4 AAC tools to FPGA to enhance the performance. From the statistic profile, the Huffman decoding and the IMDCT are the heaviest work tools for DSP processing, so we try to implementation these tools on the FPGA.
5.1 Huffman Decoding
In this section, we describe the implementation and optimization of the Huffman decoding on FPGA. We will implement two different architectures of Huffman decoder and compare the results.
5.1.1 Integration Consideration
In the MPEG-4 AAC decoder, the Huffman decoder receives a series of bits ranging from 1 bit to 19 bits from the input bitstream. It uses these bits to search for the matched pattern in the code index table. Then it returns a code index and length. The code index is ranging from
0 to 120, and we will take this value to find the codeword from the codeword table. Fig. 5.1 shows the flow diagram of the MPEG-4 AAC Huffman decoding process.
Fig. 5.1 Flow diagram of MPEG-4 AAC Huffman decoding
As we can see, the length of a symbol in the bitstream varies from 1 bit to 19 bits. The range of the code index in the table is 0 to 120, and its length is fixed to 7 bits. DSP is not suitable to do the variable length data processing, because it needs many extra clock cycles to find the correct length. Hence, we map out the MPEG-4 AAC Huffman decoder on DSP/FPGA. The patterns in the code index table are variable length, so we put it on FPGA; and the patterns in the codeword table are fixed length, so we put it on DSP. Fig. 5.2 shows the scheme of the DSP/FPGA integrated Huffman decoding.
Fig. 5.2 Block diagram of DSP/FPGA integrated Huffman decoding
5.1.2 Fixed-output-rate Architecture
We put the code index table on FPGA. Also we want to implement the fixed-output-rate Huffman decoder architecture on FPGA. If we want to enhance the Huffman decoding performance substantially, we have to implement the parallel model on FPGA. This architecture outputs one code index in one clock cycle continuously.
We designed the code index table with the necessary control signals, Fig. 5.3 shows the block diagram. Because the code index range is from 0 to 120, we use 7-bit to represent the data. Allowing DSP fetch the code index easily, we put one bit “0” between two adjacent code indices in the output buffer. Fig 5.4 shows the output buffer diagram. In this way, the DSP can fetch the code index in “char” datatype easily.
Fig. 5.3 Block diagram of fixed-output-rate architecture
Fig. 5.4 Output Buffer of code index table
The architecture needs some control signals between DSP and FPGA. When the DSP sends the “input_valid” signal to FPGA, it means the “input_data” is valid now. When the FPGA receives the “input_valid” signal and the FPGA is not busy, it would send a response of “input_res” signal to DSP, means the FPGA has received the input data successfully. But when the FPGA is busy, it would not send the “input_res” signal, meaning the FPGA has not
Fig 5.5 Waveform of the fixed-output-rate architecture
5.1.3 Fixed-output-rate Architecture
Implementation Result
Fig. 5.6 and Fig. 5.7 show the Xilinx ISE 6.1 synthesis and the P&R (place & route) reports. The P&R report shows that the clock cycle can reach 5.800 ns (172.4 MHz). It needs one clock cycle latency for the input register, meaning that we can retrieve about 156.7 M code indeces in one second. We use a test sequence of 38 frames and it contains 13188 code indeces. The comparison of DSP implementation and the FPGA implementation is shown in the Table 5.1.
Fig 5.6 Synthesis report of the fixed-output-rate architecture Timing Summary:
Speed Grade: -6
Minimum period: 9.181ns (Maximum Frequency: 108.918MHz) Minimum input arrival time before clock: 4.812ns
Maximum output required time after clock: 4.795ns Maximum combinational path delay: No path found Device utilization summary:
Selected Device : 2v2000ff896-6
Number of Slices: 820 out of 10752 7% Number of Slice Flip Flops: 379 out of 21504 1% Number of 4 input LUTs: 1558 out of 21504 7% Number of bonded IOBs: 284 out of 624 45% Number of GCLKs: 1 out of 16 6%
Fig 5.7 P&R report of the fixed-output-rate architecture
" ! " 0 ( "2 " 4 1 /&
7- "2 " 4 1 / 4&& Table 5.1 The performance Comparison of DSP and FPGA implementation
5.1.4 Variable-output-rate Architecture
The fixed output rate Huffman decoder is limited by the speed of searching for the matched pattern [12]. We can further split the code index table into several small tables to reduce the comparison operations in one clock cycle. In this way, we can use shorten the time of processing short symbol, and it needs more than one clock cycle time to process the long
Timing Summary: Speed Grade: -6
Device utilization summary:
Number of External IOBs 285 out of 624 45% Number of LOCed External IOBs 0 out of 285 0% Number of SLICEs 830 out of 10752 7% Number of BUFGMUXs 1 out of 16 6%
Fig. 5.8 Block diagram of the variable-output-rate architecture
Fig. 5.9 shows that the waveform and the external control signals between DSP/FPGA are the same for the fixed output rate architecture. The difference between the fixed-output-rate and the variable-output-rate architectures is the internal control signal of the variable-output-rate architecture is more complex, and the variable output rate architecture may need more clock cycle to produce the result.