Chapter 2 MPEG-2 Advanced Audio Coding
2.7 Noiseless Coding
The input to the noiseless coding module is the set of 1024 quantized spectral coefficients. Since the noiseless coding is done inside the quantizer inner loop, it is part of an iterative process that converges when the total bit count achieves the available bit count. The noiseless coding uses sectioning and variable-length Huffman coding (entropy coding). It exploits statistical redundancy to efficiently encode the 1024 coefficients. Section technique is powerful technique by group 2 or 4 coefficients to reduce the bit-rate.
When there are eight short windows in a frame, grouping and interleaving mechanism are designed for better coding efficiency. The coefficients associated with contiguous short windows can be grouped to share scalefactors among all scalefactor bands within the group.
In addition, the coefficients within a group are interleaved by interchanging the order of the scalefactor bands and windows.
In order to increase compression, scalefactors associated with the scalefactor bands that have zero-valued coefficients are ignored in the noiseless coding and do not have to be transmitted. Both the global gain and scalefactors are quantized in 1.5 dB steps. The scalefactors are normalized by the global gain. The global gain is coded as an 8-bit unsigned integer, and the scalefactors are differentially encoded relative to the previous scalefactor value.
The noiseless coding segments the set of 1024 quantized spectral coefficients, such that a single Huffman codebook is used to code each section. The Huffman coding is used to represent n-tuples of quantized coefficients, with 12 codebooks can be used. The spectral coefficients within n-tuples are ordered and the n-tuples size is two or four coefficients. Each codebook specifies the maximum absolute value that it can represent and the n-tuple size.
Most codebooks represent unsigned values in order to save codebook storage.
Chapter 3 MPEG-4 High Efficient Advanced Audio Coding
In this chapter, we will introduce several basic concepts and major modules of the MPEG-4 High Efficient-AAC system and the Spectral Band Replication (SBR) tool. SBR is a unique bandwidth extension technique developed by Coding Technologies. It enables audio codec to operate at lower bit-rate without sacrificing sound quality. Details can be found in [2]
and [4] respectively.
3.1 MPEG-4 High Efficient Advance Audio Coding
MPEG-4 High Efficient Advanced Audio Coding (HE-AAC) is a combination of MPEG AAC and the spectral band replication (SBR) tool. In December 2001, SBR has been submitted to MPEG and became the reference model of the MPEG-4 version 3 audio standardization process. SBR was finalized during the March 2003 MPEG meeting (14496-3:2001/Amd.1:2003). SBR is the bandwidth extension technology developed by Coding Technologies in Germany. It uses the concept that human ear is sensitive to low frequency signals but is insensitive to high frequency. At the encoder side, we encode the low frequency audio signals using regular method and the high frequency audio signals are represented by a small amount of side-information. At the decoder side, it uses the side-information to reconstruct the high frequency component of the audio signals. HE-AAC is also called aacPlus. It is able to deliver high quality audio signal at a 30% lower bit-rate with an increased complexity. It delivers good audio quality at 24 kbps for mono and 48 kbps for stereo signals. SBR is not a self-contained audio coder. It has been integrated to the
MPEG-Layer III (mp3). Mp3Pro is the result of combining mp3 with SBR. Our audio codec, HE-AAC, is MPEG-2 AAC LC profile with SBR because of the memory consideration.
Figure 3.1 shows the block diagram of SBR module and audio coder [4]. SBR acts as a pre-process to the audio encoder, and as a post-process to the core decoder. We will describe the SBR tool in the Section 3.3 , and demonstrate how this tool can achieve good coding efficiency.
Figure 3.1 The block diagram of SBR module and audio coder [4].
3.2 Spectral Band Replication
3.2.1 Why SBR Improves Audio Coding
Research on perceptual audio coding started about twenty years ago. As a consequence the MP3 and AAC were developed with high compression efficiency. In Figure 3.2, the encoder estimates the masking threshold and tries to shape the quantization noise in the frequency domain to be lower than the masking threshold. This can achieve fine audio quality at low bitrates.
Although today’s perceptual waveform codecs already achieve good compression, the efficiency is not high enough to fulfill the bandwidth limitation for broadcasting systems and wireless systems. If the bitrate of afore mentioned audio codecs is significantly lower, the maximum distortion would exceed the masking threshold. One way to solve this problem is to limit the audio bandwidth to achieve lower bitrate. In this case, the high frequency signals are generated using a little side information. Since there is no high frequency signals to be encoded, more bits are available for encoding the remainder of the spectrum (lowband signals). HE-AAC encodes the lowband signals on encoder side, and decodes the full frequency audio signals on the decoder side with the help of the SBR technique. We thus have good audio quality on lower bitrate.
The SBR technology can be combined with any perceptual audio codec in a backward compatible way, which is shown in the Figure 3.1. It is based on the fact that there are usually high correlations between the lower and higher frequency part of audio signals. Hence, we can use lowband signals to reconstruct the highband signals. Only small amount of the side information is required to carry in the bitstream in order reconstruct of the highband signals.
On the decoder side, the highband signals are reconstructed by a high quality transposition algorithm. Figure 3.3(a) shows the transposition from lowband signal to highband signal. But transposition itself is insufficient for reconstructing highband signals. It also uses the side information sent from the encoder to adjust the highband signals, such as energy envelope, inverse filtering to cancel tones, and the noise and sine addition to maintain the tonal-to-noise ratio shown in Figure 3.3(b). Figure 3.3(c) shows that high frequency reconstruction through SBR.
Frequency
Energy
Energy shaping Adding tone and noise
Frequency
Energy Transposition
(a) (b)
Frequency
Energy
Reconstruction by SBR
Frequency
Energy
Reconstruction by SBR
(c)
Figure 3.3 (a) Creation of highband by transposition. (b) Envelope adjustment of highband.
(c) High frequency reconstruction through SBR.
In summary, SBR enhanced codecs perform better because:
(1) SBR allows the reconstruction of the high frequency part of signals using a small amount of side information. The high frequency signals are not encoded anymore. It results in a significant coding gain.
(2) The traditional audio codecs, such as AAC, encode the low frequency signals in which it can operate at the optimum sampling rate. However, the optimum sampling rate is usually different from the desired output sampling rate. On the other hand, the SBR decoder can convert the codec sampling rate to the desired output sampling rate.
3.2.2 How SBR Works
The SBR system is used as a dual-rate system. The SBR encoder operates at the original sampling rate, and the AAC encoder operates at half the original sampling rate. The AAC encoder just processes only the low frequency part of audio signals. It uses a downsampling filter to obtain the low frequency part of audio signals. The AAC encoder computation is lower because it processes half of input data. But the SBR encoder is complex because it uses many modules to extract the high frequency signals information. The following Section will briefly explain the SBR encoder system.
3.3 SBR Encoder
Bitrate control SBR data bitstream Bitstream
Figure 3.4 HE-AAC Encoder Overview [6].
Figure 3.4 shows the block diagram of the 3GPP HE-AAC encoder system. We can notice that the SBR encoder works in parallel with the AAC encoder. The important parameters are extracted by the SBR encoder in order to ensure an accurate high frequency reconstruction at the decoder. The input signal is fed to a 64-channel Analysis Quadrature Mirror Filter (AQMF) which will be described in Section 3.3.2 . The output from the filter
The spectral envelopes of the current frame are estimated over the time segment and with the frequency resolution given by the time/frequency grid. In order to achieve optimal quality, given the high frequency generator which used in the decoder, several additional parameters apart from the spectral envelope are extracted. When the lowband signals are be transposed to the highband signals, it may have the situation that lowband constitutes a strong harmonic series but the highband constitutes random signal. Or the strong tonal components are present in the original highband but not in the lowband. To handle the inconsistence of the tonal-to-noise ratio of the original spectral bands and the replicated spectral bands, the adding of noise or sinusoids with suitable energy is considered. Then the SBR data and other parameters are coded by entropy coding (Huffman coding). SBR data and AAC data information is exchange between the system in order to determine the optimal cutoff frequency between the AAC encoder and the SBR band. Finally the HE-AAC encoder multiplexes the SBR bitstream into the AAC bitstream. Figure 3.5 shows the block diagram of the SBR Encoder. Details can be found in [14].
BitstreamMultiplexer
Figure 3.5 SBR encoder block diagram [14].
3.3.2 Analysis Quadrature Mirror Filter (AQMF) Bank
On the SBR encoder side, subband filtering of the input signal is done by a 64-subband QMF bank. The outputs from the filterbank are complex-valued. The filtering comprises the following steps, in which an array x consisting of 640 time domain input samples are assumed.
Higher indices into the array correspond to older samples. Figure 3.6 shows the QMF analysis window.
Figure 3.6 HE-AAC QMF analysis windowing [2]. Index 0 to 31 represent different window.
The QMF process is described:
1. Shift the samples in the array x by 64 positions. The oldest 64 samples are discarded and 64 new samples are stored in positions 0 to 63.
2. Multiplying the samples of array x by window c is array Z ( Z[n] = x[n] c[n]× , for n =0 to 639). The 640 window coefficients ( c ) are showed in Figure 3.7.
3. Sum the samples according to the formula,
4
j=0
u[n] =
∑
Z[n+128j] , n=0 to 127, to create the 128-element array u.4. Calculate 64 new subband samples by the matrix operation X = Mu, where
(
0.5 2)(
1)
0 64,X[k][j] corresponds to the jth subband sample QMF subband k.
Every loop produces 64 complex-valued subband samples, representing the output from one filterbank subband. For every SBR frame the filterbank produce 32 subband samples from every filterbank subband, corresponding to a time domain signal of length 2048 samples.
0 100 200 300 400 500 600
Figure 3.7 Coefficients of the QMF bank window.
3.3.3 Frequency Band Tables
On the SBR encoder side, the SBR encoder uses the following frequency band tables: a high frequency resolution table ( fTableHigh), a low frequency resolution table (fTableLow), the noise floor frequency tables ( fTableNoise) and the master frequency band table (fMaster), which are defined according to subclause 4.6.18.3.2 in [2]. The parameters needed to define all frequency band tables are transmitted in the SBR bitstream header. The frequency band tables contain the frequency borders for each frequency band, represented as QMF subbands. Each frequency band is defined by a start frequency border and a stop frequency border. For SBR header bitstream elements either bs_header_extra_1 or bs_header_extra_1, there are default values and a transmission of these elements are only needed if they differ from the default value. Default values are defined in subclause 4.5.2.8.1 in [1]. The SBR header parameters are regarded as tuning parameters since they are strongly bitrate and sampling frequency
3.3.4 Time-Frequency Grid Generation
Information obtained from the analysis QMF bank is used to choose the appropriate time/frequency resolution of the current SBR frame. On the encoder side, the T/F grid generation algorithm calculates the start and stop time broder of the SBR envelopes and the noise floors in the current SBR frame. The T/F grid generation algorithm divides the current SBR frame into four classes, FIXFIX, FIXVAR, VARFIX and VARVAR. They use to determine the time broder of each SBR frame.
On the SBR decoder part, the T/F grid part of the bitstream payload describes the number of SBR envelopes and noise floors as well as the time segment associated with each SBR envelope and noise floor. Furthermore, it describes what frequency band tables to use for each SBR envelope. Four different SBR frame classes, FIXFIX, FIXVAR, VARFIX and VARVAR, are used, and each of which has different capabilities with respect to time-frequency grid selection. Figure 3.8 shows the example of the time-frequency grid.
Detail can be found in [2], subclause 4.B.18.3.
On the SBR encoder part, the SBR encoder of 3GPP HE-AAC employs three tools for the T/F grid generation: the transient detector, the frame splitter, and the frame generator, that will be described in the following.
Time Frequency
Time-frequency grid
Figure 3.8 Example of the time-frequency grid
(A) Transient Detector
The transient detection is performed on subband samples of one frame length. The outputs from the transient detector are the variables tranFlag and tranPos. The first is a boolean indicating whether there is a transient in the processed frame, and the second specifies the position (in time slots) for the on-set of the transient. The time / frequency grid generation module uses the output from the transient detector and the stored transient detection output from the previous frame to perform its operations. Figure 3.9 shows the flow chart of transient detector.
Begin
Calculate different T/F grid energy in a frame (64 frequency resolution
and 32 time segment)
If (previous time segment energy
> 203.125)
and if (0.9 × previous time segment energy
>current time segment energy)
Transient Flag = 0 Transient Position = 0
Transient Flag = 1 Transient Position = Current time slot position
No Yes
Return
Figure 3.9 The flow chart of transient detector.
(B) Frame Splitter
The frame splitting is only active when the transient detector has detected the absence of a transient in the current frame (i.e. when transient Flag = 0). It operates on subband samples
transients) should be divided into two envelopes of equal size. Figure 3.10 shows the flow chart of frame splitter.
Calculate split threshold Begin
Calculate the total lowband’s energy
Calculate the total highband’s energy
Calculate the “d_ratio”
value which is dependent on lowband and highband energy
d_ratio > split threshold
split Flag = 0
No Yes
split Flag = 1
Figure 3.10 The flow chart of transient detector. The split threshold depends on the sampling rate and bitrate.
(C) Frame Generator
The frame generator creates the time/frequency grid for one SBR frame. Input signals are provided by the transient detector and the frame splitter. The frame generator produces two outputs: The sbr_grid() portion of the bitstream, and an internal representation of the time/frequency grid to be used by the envelope and noise floor estimators.
When no transients are present (tranFlag = 0), FIXFIX class frames are used. The frame splitter decides whether to use one or two envelopes in the FIXFIX frames (splitFlag = 0 or splitFlag = 1 respectively). "Sparse" transients (separated by one or more frames with
frames. We do not show the block diagram of frame generator ,and details can be found in [14], subclause 5.4.3.
3.3.5 Envelope Estimator
On the SBR encoder side, the spectral envelopes of the current SBR frame are estimated over the time segment and with the frequency resolution given by the time/frequency grid represented by tE and r. The SBR envelope is estimated by averaging the squared complex subband samples over the given time/frequency regions.
n= = Number of frequency bands for low and high frequency resolution
In the case of stereo and coupling the energy is calculated according to:
( )
( )( )( ( ) ( ) )
3.3.6 Additional Control Parameters
In order to achieve optimal results in the decoder, several additional parameters apart from the spectral envelope are needed. The noise floor is estimated for the current SBR frame.
It is defined as the ratio between the energy of the noise and the energy of the High Frequency (HF) generator signal. The energy of the noise that should be added to a particular frequency band in order to obtain a similar tonal to noise ratio.
The noise floor is estimated once or twice per SBR frame dependent on the number of spectral envelopes estimated for the SBR frame (indicated by tQ). The frequency resolution for the noise floor scalefactor is calculated according to the same algorithm subsequently used in the SBR decoder and described in [2] subclause 4.6.18.3. The start and stop time borders of the different noise floors are given from the time grid on the SBR encoder.
The level of the inverse filtering applied in the decoder is estimated for different frequency ranges. The inverse filtering estimation algorithm compares the original tonality and the tonality which will be produced by the High Frequency (HF) generator in the decoder.
The ratio between the two is mapped to four different inverse filtering levels, off, low, mid and high. These levels correspond to different chirp factors in the HF generator as outlined in [2] subclause 4.6.18.5.
On SBR encoder side, additional control parameters include three factors: noise, inverse filtering and sine signal. The estimation noise is added in the decoder to obtain the same tonal to noise ratio. The inverse filtering is used to flat the reconstructed tonal signal on decoder side when original signal does not have tonal. The estimation sine signal is added in the decoder. If the high frequency reconstructed signals in the decoder miss the sinusoidal signal, the frequency bands will add a strong sinusoidal component. These three factors are all calculated by the output of tonal estimation. Therefore, we only describe the tonality estimation in the follows.
3.3.7 Tonality Estimation
The following detection modules produce their output based on a tonality estimate calculated in the tonality estimation module: Noise-floor estimation, Inverse filtering estimation, Additional sines estimation. These three modules are calculated by the output of the tonality estimation. Therefore, we will only describe tonality estimation module in this Section, because the complexity of the tonality estimation module is higher than the other three modules. The Noise-floor estimation, Inverse filtering estimation, Additional sines estimation modules can be found in [14], subclause 5.6.3.
The tonality is derived from the prediction gain of a second order linear prediction performed in every QMF subband. The linear predictive coding (LPC) is calculated using the covariance method, and for every frame two tonality estimates are calculated for every subband. In the equation 2.10, X is the matrix holding the most recently available complex QMF subband samples. The tonality values are calculated and stored in the T and Tsbr matrices. The Tsbr values are obtained from the T values by patching the tonality values similarly to the patching of the subband channels in the high frequency reconstruction modules in the decoder.
Since the subband signals are complex valued, this results in complex filter coefficients.
The prediction filter coefficients are obtained from the covariance method.
0
where k is the subband index.
(3.4)
Based on the covariance elements the coefficients 0
( )
l k
α and 1
( )
l k
α used to calculate the tonality estimates for the subbands are calculated as:
The tonality values are calculated based on the above coefficients according to:
{ }
where l = 0(lower half frame) or 1(upper half frame).
(3.7)
The tonality values are patched similarly to the patching of the QMF subbands in the decoder during high frequency reconstruction. Hence, it is possible to compare tonality of a simulated
The tonality values are patched similarly to the patching of the QMF subbands in the decoder during high frequency reconstruction. Hence, it is possible to compare tonality of a simulated