• 沒有找到結果。

Chapter 2 MPEG-2/4 Advanced Audio Coding

2.1 MPEG-2 AAC

2.1.5 Prediction

Prediction tool is used for improved redundancy reduction in spectral coefficients. If the spectral coefficients are stationary between adjacent frames, the prediction tool will estimate

For each channel, there is one predictor corresponding to the spectral component from the spectral decomposition of the filterbank. The predictor exploits the autocorrelation between the spectral component values of consecutive frames. The predictor coefficients are calculated from preceding quantized spectral components in the encoder. In this case, the spectral component can be recovered in the decoder without other predictor coefficients. A second-order backward-adaptive lattice structure predictor is working on the spectral component values of the two preceding frames. The predictor parameters are adapted to the current signal statistics on a frame-by-frame base, using an LMS-based adaptation algorithm.

If prediction is activated, the quantizer is fed with a prediction error instead of the original spectral component, resulting in a higher coding efficiency.

Fig. 2.6 Prediction tool for one scalefactor band [2]

2.1.6 Middle/Side Tool

There are two different choices to code each pair of the multi-channel signals, the original left/right (L/R) signals or the transformed middle/side (M/S) signals. If the high correlated left and right signals could be summed, the require bits to code this signals will be less. Hence in the encoder, the M/S tool will operate when the left and right signals’ correlation is higher than a threshold. The M/S tool transform the L/R signals to M/S signals, where the middle signal equals to the sum of left and right signals, and the side signal equals to the difference of left and right ones.

2.1.7 Scalefactors

The human hearing system can be modeled as several over-lapped bandpass filters. With higher central frequency, each filter has larger bandwidth. These bandpass filters are called critical bands. The scalefactors tool divides the spectral coefficients into groups, called scalefactor bands, to imitate critical bands. Each scalefactor band has a scalefactor, and all the spectral coefficients in the scalefactor band are divided by this corresponding scalefactor. By adjusting the scalefactors, quantization noise can be modified to meet the bit-rate and distortion constraints.

2.1.8 Quantization

While all previous tools perform some kind of preprocessing of audio data, the real bit-rate reduction is achieved by the quantization tool. On the one hand, we want to quantize the spectral coefficients in such a way that quantization noise under the masking threshold; on the other hand, we want to limit the number of bits requested to code this quantized spectral coefficients.

There is no standardized strategy for gaining optimum quantization. One important issue is the tuning between the psychoacoustic model and the quantization process. The main advantage of nonuniform quantizer is the built-in noise shaping depending on the spectral coefficient amplitude. The increase of the signal-to-noise ratio with rising signal energy is much lower values than in a linear quantizer.

2.1.9 Noiseless Coding

The noiseless coding is done via clipping spectral coefficients, using maximum number of

their location. Since the side information for carrying the clipped spectral coefficients costs some bits, this compression is applied only if it results in a net saving of bits.

The Huffman coding is used to represent n-tuples of quantized spectral coefficients, with 12 codebooks can be used. The spectral coefficients within n-tuples are ordered from low frequency to high frequency and the n-tuple size can be two or four spectral coefficients. Each codebook specifies the maximum absolute value that it can represent and the n-tuple size.

Two codebooks are available for each maximum absolute value, and represent two distinct probability distributions. Most codebooks represent unsigned values in order to save codebook storage. Sign bits of nonzero coefficients are appended to the codeword.

2.2 MPEG-4 AAC Version 1

MPEG-4 AAC Version 1 was approved in 1998 and published in 1999. It has all the tools of MPEG-2 AAC. It includes additional tools such as the long term predictor (LTP) tool, perceptual noise substitution (PNS) tool and transform-domain weighted interlaced vector quantization (TwinVQ) tool. The TwinVQ tool is an alternative tool for the MPEG-4 AAC quantization tool and noiseless coding tool. This new scheme which combined AAC with TwinVQ is officially called "General Audio (GA)." We will introduce these new tools in this section.

Fig. 2.7 Block diagram of MPEG-4 GA encoder [2]

2.2.1 Long Term Prediction

The long term prediction (LTP) tool uses to exploit the redundancy in the speech signal which is related to the signal periodicity as expressed by the speech pitch. In speech coding, the sounds are produced in a periodical way so that the pitch phenomenon is obvious. Such phenomenon may exist in audio signals as well.

Fig. 2.8 LTP in the MPEG-4 General Audio encoder [2]

The LTP tool performs prediction to adjacent frames while MPEG-2 AAC prediction tool perform prediction on neighboring frequency components. The spectral coefficients transform back to the time-domain representation by inverse filterbank and the associated inverse TNS tool operations. Comparing the locally decoded signal to the input signal, the optimum pitch lag and gain factor can be determined. The difference between the predicted signal and the original signal then is calculated and compared with the original signal. One of them is selected to be coded on a scalefactor band basis depending on which alternative is more favorable.

The LTP tool provides considerable coding gain for stationary harmonic signals as well as some non-harmonic tonal signals. Besides, the LTP tool is much less computational complexity than original prediction tool.

2.2.2 Perceptual Noise Substitution

The perceptual noise substitution (PNS) tool gives a very compact representation of noise-like signals. In this way, the PNS tool provides that increasing of the compression efficiency for some type of input signals.

In the encoder, the noise-like component of the input signal is detected on a scalefactor band basis. If spectral coefficients in a scalefactor band are detected as noise-like signals, they will not be quantized and entropy coded as usual. The noise-like signals omit from the quantization and entropy coding process, but coded and transmitted a noise substitution flag and the total power of them.

In the decoder, a pseudo noise signal with desired total power is inserted for the substituted spectral coefficients. This technique results in high compression efficiency since only a flag and the power information is coded and transmitted rather than whole spectral coefficients in the scalefactor band

2.2.3 TwinVQ

The TwinVQ tool is an alternative quantization/coding kernel. It is designed to provide good coding efficiency at very low bit-rate (16kbps or even lower to 6kbps). The TwinVQ kernel first normalizes the spectral coefficients to a specified range, and then the spectral coefficients are quantized by means of a weighted vector quantization process.

The normalization process is carried out by several schemes such as linear predictive coding (LPC) spectral estimation, periodic component extraction, Bark-scale spectral estimation, and power estimation. As a result, the spectral coefficients are "flattened" and normalized across the frequency axis.

The weighted vector quantization process is carried out by interleaving the normalized spectral coefficients and dividing them into sub-vectors for vector quantization. For each sub-vector, a weighted distortion measure is applied to the conjugate structure VQ which uses a pair of code books. Perceptual control of quantization noise is achieved in this way. The process is shown in Fig 2.9.

Fig. 2.9 TwinVQ quantization scheme [2]

2.3 MPEG-4 AAC Version 2

MPEG-4 AAC Version 2 was finalized in 1999. Compared to MPEG-4 Version 1, Version 2 adds several new tools in the standard. They are Error Robustness tool, Bit Slice Arithmetic Coding (BSAC) tool, Low Delay AAC (LD-AAC). The BSAC tool is for fine-grain bitrate scalability, and the LD-AAC for coding of general audio signals with low delay. We will introduce these new tools in this section.

2.3.1 Error Robustness

The Error Robustness tools provide improved performance on error-prone transmission channels. The two classes of tools are the Error Resilience (ER) tool and Error Protection (EP) tool.

The ER tool reduces the perceived distortion of the decoded audio signal that is caused by corrupted bits in the bitstream. The following tools are provided to improve the error robustness for several parts of an AAC bitstream frame: Virtual CodeBook (VCB), Reversible Variable Length Coding (RVLC), and Huffman Codeword Reordering (HCR). These tools

allow the application of advanced channel coding techniques, which are adapted to the special needs of the different coding tools.

The EP tool provides Unequal Error Protection (UEP) for MPEG-4 Audio. UEP is an efficient method to improve the error robustness of source coding schemes. It is used by various speech and audio coding systems operating over error-prone channels such as mobile telephone networks or Digital Audio Broadcasting (DAB). The bits of the coded signal representation are first grouped into different classes according to their error sensitivity. Then error protection is individually applied to the different classes, giving better protection to more sensitive bits.

2.3.2 Bit Slice Arithmetic Coding Tool

The Bit-Sliced Arithmetic Coding (BSAC) tool provides efficient small step scalability for the GA coder. This tool is used in combination with the AAC coding tools and replaces the noiseless coding of the quantized spectral data and the scalefactors. The BSAC tool provides scalability in steps of 1 kbps per audio channel, which means 2 kbps steps for a stereo signal.

One base layer bitstream and many small enhancement layer bitstreams are used. The base layer contains the general side information, specific side information for the first layer and the audio data of the first layer. The enhancement streams contain only the specific side information and audio data for the corresponding layer.

To obtain fine step scalability, a bit-slicing scheme is applied to the quantized spectral data.

First the quantized spectral coefficients are grouped into frequency bands. Each of group contains the quantized spectral coefficients in their binary representation. Then the bits of a group are processed in slices according to their significance. Thus all of the most significant bits (MSB) of the quantized spectral coefficients in each group are processed. Then these bit-slices are encoded by using an arithmetic coding scheme to obtain entropy coding with

coefficients are refined by providing more less significant bits (LSB), and the bandwidth is increased by providing bit-slices of the spectral coefficients in higher frequency bands.

2.3.3 Low-Delay Audio Coding

The MPEG-4 General Audio Coder provides very efficient coding of general audio signals at low bitrates. However it has an algorithmic delay of up to several 100ms and is thus not well suited for applications requiring low coding delay, such as real-time bi-directional communication. To enable coding of general audio signals with an algorithmic delay not exceeding 20 ms, MPEG-4 Version 2 specifies a Low-Delay Audio Coder which is derived from MPEG-2/4 Advanced Audio Coding (AAC). It operates at up to 48 kHz sampling rate and uses a frame length of 512 or 480 samples, compared to the 1024 or 960 samples used in standard MPEG-2/4 AAC. Also the size of the window used in the analysis and synthesis filterbank is reduced by a factor of 2. No block switching is used to avoid the “look-ahead”

delay due to the block switching decision. To reduce pre-echo phenomenon in case of transient signals, window shape switching is provided instead. For non-transient parts of the signal a sine window is used, while a so-called low overlap window is used in case of transient signals. Use of the bit reservoir is minimized in the encoder in order to reach the desired target delay. As one extreme case, no bit reservoir is used at all.

2.4 MPEG-4 AAC Version 3

MPEG-4 AAC Version 3 was finalized in 2003. Like MPEG-4 Version2, Version 3 adds some new tools to increase the coding efficiency. The main tool is SBR (spectral band replication) tool for a bandwidth extension at low bitrates encodings. This result scheme is called High-Efficiency AAC (HE AAC).

The SBR (spectral band replication) tool improves the performance of low bitrate audio by either increasing the audio bandwidth at a given bitrate or by improving coding efficiency at a given quality level. When the MPEG-4 AAC attaches to SBR tool, the encoders encode

lower frequency bands only, and then the decoders reconstruct the higher frequency bands based on an analysis of the lower frequency bands. Some guidance information may be encoded in the bitstream at a very low bitrate to ensure the reconstructed signal accurate. The reconstruction is efficient for harmonic as well as for noise-like components and allows for proper shaping in the time domain as well as in the frequency domain. As a result, SBR tool allows a very large bandwidth audio coding at low bitrates.

Chapter 3

Introduction to DSP/FPGA

In our system, we will use Digital Signal Processor/Field Programmable Gate Array (DSP/FPGA) to implement MPEG-4 AAC encoder and decoder. The DSP baseboard is made by Innovative Integration's Quixote, which houses Texas Instruments' TMS320C6416 DSP and Xilinx Virtex-II FPGA. In this chapter, we will describe DSP baseboard, DSP chip and FPGA chip. At the end, we will introduce the data transmission between the Host PC and the DSP/FPGA

3.1 DSP Baseboard

Quixote combines one TMS320C6416 600MHz 32-bit fixed-point DSP with a Xilinx Virtex-II XC2V2000/6000 FPGA on the DSP baseboard. Utilizing the signal processing technology to provide processing flexibility, efficiency and deliver high performance. Quixote has 32MB SDRAM for use by DSP and 4 or 8Mbytes zero bus turnaround (ZBT) SBSRAM for use by FPGA. Developers can build complex signal processing systems by integrating these reusable logic designs with their specific application logic.

Fig. 3.1 Block Diagram of Quixote [5]

3.2 DSP Chip

The TMS320C64x fixed-point DSP is using the VelociTI architecture. The VelociTI architecture of the C6000 platform of devices use advanced VLIW (very long instruction word) to achieve high performance through increased instruction-level parallelism, performing multiple instructions during a single cycle. Parallelism is the key to extremely high performance, taking the DSP well beyond the performance capabilities of traditional superscalar designs. VelociTI is a highly deterministic architecture, having few restrictions on how or when instructions are fetched, executed, or stored. It is this architectural flexibility that

Fig 3.2 Block diagram of TMS320C6x DSP [6]

TMS320C6416 has internal memory includes a two-level cache architecture with 16 KB of L1 data cache, 16 KB of L1 program cache, and 1 MB L2 cache for data/program allocation. On-chip peripherals include two multi-channel buffered serial ports (McBSPs), two timers, a 16-bit host port interface (HPI), and 32-bit external memory interface (EMIF).

Internal buses include a 32-bit program address bus, a 256-bit program data bus to accommodate eight 32-bit instructions, two 32-bit data address buses, two 64-bit data buses, and two 64-bit store data buses. With 32-bit address bus, the total memory space is 4 GB, including four external memory spaces: CE0, CE1, CE2, and CE3. We will introduce several important parts in this section.

3.2.1 Central Processing Unit (CPU)

Fig. 3.2 shows the CPU, and it contains Program fetch unit

Instruction dispatch unit, advanced instruction packing Instruction decode unit

Two data path, each with four functional units 64 32-bit registers

Control registers Control logic

Test, emulation, and interrupt logic

The program fetch, instruction dispatch, and instruction decode units can deliver up to eight 32-bit instructions to the functional units every CPU clock cycle. The processing of instructions occurs in each of the two data paths (A and B), each of which contains four functional units (.L, .S, .M, and .D) and 32 32-bit general-purpose registers. Fig. 3.3 shows the comparison of C62x/C67x with C64x CPU.

3.2.2 Data Path

Fig 3.3 TMS320C64x CPU Data Path [6]

There are two general-purpose register files (A and B) in the C6000 data paths. The C64x DSP register is double the number of general-purpose registers that are in the C62x/C67x cores, with 32 32-bit registers (A0-A31 for file A and B0-B31 for file B).

There are eight independent functional units divided into two data paths. Each path has a unit for multiplication operations (.M), for logical and arithmetic operations (.L), for branch, bit manipulation, and arithmetic operations (.S), and for loading/storing and arithmetic

operations (.D). The .S and .L units are for arithmetic, logical, and branch instructions. All data transfers make use of the .D units. Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. It can be a maximum of two cross-path source reads per cycle. Fig. 3.4 and 3.5 show the functional unit and its operations.

Fig. 3.4 Functional Units and Operations Performed [7]

Fig. 3.5 Functional Units and Operations Performed (Cont.) [7]

3.2.3 Pipeline Operation

Pipelining is the key feature to get parallel instructions working properly, requiring careful timing. There are three stages of pipelining: program fetch, decode, and execute, and each stage contains several phases. We will describe the function of the three stages and their associated multiple phases in the section.

The fetch stage is composed of four phases PG: Program address generate

PS: Program address send PW: Program address ready wait PR: Program fetch packet receive

During the PG phase, the program address is generated in the CPU. In the PS phase, the program address is sent to memory. In the PW phase, a memory read occurs. Finally, in the PR phase, the fetch packet is received at the CPU.

The decode stage is composed of two phases.

DP: Instruction dispatch DC: Instruction decode

During the DP phase, the instructions in execute packet are assigned to the appropriate functional units. In the DC phase, the source registers, destination registers, and associated paths are decoded for the execution of the instructions in the functional units.

The execute stage is composed of five phases.

E1: Single cycle instruction complete.

E2: Multiply instruction complete.

E3: Store instruction complete.

E4: Multiply extensions instruction complete.

E5: Load instruction complete.

Different types of instructions require different numbers of these phases to complete their execution. These phases of the pipeline play an important role in your understanding the device state at CPU cycle boundaries.

3.2.4 Internal Memory

The C64x has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When in external (off-chip) memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C64x has two 64-bit internal ports to access internal data memory, and a single port to access internal program memory, with an instruction-fetch width of 256 bits.

16 KB program L1 cache 1M L2 cache

64 EDMA channels 3 32-bit timers

3.3 FPGA

The Xilinx Virtex-II FPGA is made by 0.15µ, 8-layer metal process; it offers logic performance in excess of 300MHz. We will introduce the FPGA logic in this section.

Virtex-II XC2V2000 FPGA contains 2M system gates

56 x 48 CLB array (row x column) 10752 slices

24192 logic cells 21504 CLB flip-flops

336K maximum distributed RAM bits

Virtex-II XC2V6000 FPGA contains 6M system gates

96 x 88 CLB array (row x column) 33792 slices

76032 logic cells 675844 CLB flip-flops

1056K maximum distributed RAM bits

Configurable Logic Blocks (CLB) is a block of logic surrounded by routing resources.

The functional elements are need to logic circuits. One CLB contains four slices; each slice contains two Logic Cells (LC); each LC includes a 4-input function generator, carry logic, and a storage element.

Fig 3.6 General Slice Diagram [10]

The synthesizer of the Xilinx FPGA is the Xilinx ISE 6.1. The simulation result was reference by the synthesizer report and the P&R report in the ISE.

3.4 Data Transmission Mechanism

In this section, we will describe the transmission mechanism between the Host PC and the

3.4.1 Message Interface

The DSP and Host PC have a lower bandwidth communications link for sending

The DSP and Host PC have a lower bandwidth communications link for sending

相關文件