The design of a hybrid filter bank for the psychoacoustic model in ISO/MPEG phases 1,2 audio encoder

(1)

586 IEEE Transactions on Consumer Electronics, Vol. 43, No. 3, AUGUST 1997

THE DESIGN OF A HYBRID FILTER BANK FOR THE PSYCHOACOUSTIC MODEL IN ISOIMPEG PHASES 1 , 2 AUDIO ENCODER'

Chi-Min Liu, Member, IEEE, and Wen-Chieh Lee

Department and Institute of Computer Science and Information Engineering National Chiao Tung University, Hsinchu, 30050, Taiwan

E-Mail: cmliu @csie.nctu.edu.tw

Abstruct- T h e ISO/MPEG phases 1 and 2 audio compression a r e receiving a wide range of applications. I n the encoding process of MPEG, the psychoacoustic model exploits audio irrelevancy which is the key role t o achieve high compression ratio without losing audio quality. However, the Fourier transform (FT) which has been used by the two psychoacoustic models suggested in standard d r a f t requires high computational complexity, and hence leads to high h a r d w a r e and software cost for real-time applications. This paper presents a new design named the hybrid filter bank t o replace the FT. T h e hybrid filter bank can be integrated with the psychoacoustic models and provides a much lower complexity than the FT. Also, this paper shows t h a t the hybrid filter is more suitable for the stereo coding and hence can provide a better quality for the intensity stereo coding, which is the key technology for the MPEG 1 to achieve near trans- parent quality lower than 96x2 kbits for two stereo channels.

I. Introduction

IKE most perceptual audio coders [ 11-[3], MPEG audio encoder can be considered from four parts: the time-frequency mapper, the psychoacoustic model, quantization and frame packing as shown in Fig. 1. The psychoacoustic model exploits audio irrelevancy which is usually defined in frequency domain. The time-frequency mapper maps the time-domain signals into a frequency representation to reduce the data redundancy and provides the ease with the integration with the psychoacoustic model. The quantization quantizes the audio signals from time-frequency mapper based on the information from the psychoacoustic model. The frame packing packs the quantized signals with some synchronous information like samDling freauencv for identified bv

... i Time to frequency transform : i

4 Bit i t ... ... Allocation i I Intensity P h j Psychoacousticmode

vi;

i

!

... Fig. 1 The Structure of the FFT-based MPEG En-

coder MPEG decoders.

In the encoding process of MPEG, the 1024-point Fourier transform (FT) has been used by psychoacoustic models to analyze the frequency components in the 1152 samples of one frame. If the conventional real-data fast FT (FFT) [4] has been adopted for im- plementing the FT, the complexity has an order of (4*256*1og(512)). Such a complexity leads to high implementation cost for real-time applications.

This paper presents a new design named the hybrid filter bank to replace the FT. The hybrid filter bank can be integrated with the psychoacoustic models and provides a much lower complexity than the FT. Also, this paper shows that the hybrid filter is more suitable for the stereo coding and hence can provide a better quality for the intensity stereo coding, which is the I key technology for the MPEG I to achieve near transparent quality lower than 96x2 kbits for two stereo channels.

This rest of this paper is organized as follows: Sec- tion I1 illustrates the design of hybrid filter banks. The hybrid filter bank has problems in the phase shift and the aliasing components arising from the decima- tion in the l s t level filter bank. Section 111 provides the method to solve the two problems. Section IV considers complexity and the integration of the hybrid filter banks with the psychoacoustic models in MPEG. Section V evaluates the design through spectrum analysis, subjective measure, and objective

A Y 1

+This work was supported in part by Acer Laboratories Inc.

under contract C85098.

(2)

Liu and Lee: The Design of a Hybrid Filter Bank for the Psychoacoustic Model in IsoAlPEG Phases 1 , 2 Audio Encoder 587

measure to show the feasibility of the hybrid filter bank. Section VI gives a brief conclusion.

II. Filter Response in Hybrid Filter Banks The motivation of the hybrid filter banks can be considered from the two frequency analyzers in the time-frequency mapper and the psychoacoustic model. The MPEG has adopted a 32-band polyphase filter bank which can provide a frequency resolution xi32

with sidelobe attenuation 96 dB while the FT with Hann window a resolution

~ 1 5 1 2

with attenuation 32 dB. The approach of the hybrid filter bank is to cascade another filter bank, named the second (2nd ) level filter bank, to the output of the original polyphase filter bank, named the first ( l S t ) level filter bank, to achieve a high frequency resolution. The block diagram of the hybrid filter bank is shown in Fig. 2. i : "alize 4 f *

i

Pol*filt€Tbnk ... _...

I

Bit

I

I ...

Fig. 2 Structure of MPEG encoder based on the hybrid filter banks

Fig. 3 shows the detailed structure of the hybrid fil- ter bank. The structure adopts a 16-band filter bank based on the time domain aliasing cancellation (TDAC) filter bank [6] for each band of the 1" level filer bank to achieve a frequency resolution as high as the FT. The input-output relation of the TDAC filter bank is

where xi(n) is the nth output of the band i from the lst level polyphase filter bank, Xj(k) is the corresponding output of the 2"d level filter bank and h(n) is the win- dow function deciding the band selectivity in the 2"d level filter bank. To achieve a frequency resolution

~ / 5 i 2 the same as the FT, the value of N is set to 32. Also, to have a frequency selectivity the same as the FT, we select the window function

(2)

R I

N 2

h(n)=sin(-(n+-)) forn=O, ..., N - 1

which has a sidelobe attenuation 24 dB as shown in Fig. 4. The function has the property

(3)

which is a necessary condition leading to the perfect reconstruction filter banks [5]. Substituting (2) into (1) yields N 2 fmk=Oto --1 I n I- CUAL ' tiltertlabk : 4.0 ...

{-.

... _i;_:_:._... ..,r.

!J:v:::::

]E€

"

n ( 1 6 s u b b a n d s w F 32 L --, 1 'ilte ! SUI

-1,,~

... ...

Fig. 3 Detailed structure of the hybrid filter bank

I I I I -37 02 -70 12 p -103 22 -13632 a 0 -16941 P

5

-20251 -23561 -268 71 -301 81 -334 910 100 200 300 400 500 Normalized frequency

Fig. 4 Power spectrum of the 2"d level filter bank 111. Phase Shifter & Alias Reduction

0

51 1

As mentioned in [7], [8], the hybrid filter bank has problems in the phase shift and the aliasing components arising from the lSt level filter bank. We follow the similar concept in [7], [8] to design a phase shifter and an alias reduction butterfly to solve these two problems.

A. Design of the phase shiffer

Due to the decimation operation implied in the 1" level filter bank, the 1'' filter bank has a phase shift

(3)

588 IEEE Transactions on Consumer Electronics, Vol. 43, No. 3, AUGUST 1997

n in the odd-indexed subbands. The phase shift causes a reversed spectrum for the subband. If further spectral analysis is needed to achieve higher frequency resolution, this shift should be corrected. This phase shift can be corrected by multiplying (-1)"to the subband signal in the odd-indexed subbands; that is I: s i n ( E ( n + - ) ) x i ( n ) c o s ( L ( 2 n + l + - ) ( 2 k + 1 ) ) f o r 1 N oddi n=il N 2 2N 2 N 2 f o r k = O t o - - 1

where oddleven stands for oddleven indexed subband of 1'' level filter bank. The phase shifter can be combined into window function to avoid computation burden.

B. Design of the aliasing reducfion bufferfly

It has been well known that the decimation operation leads to aliasing and there are decimation in the hybrid filter banks. The aliasing effects indicate a many-to-one merging between the input frequency and output frequency of filter banks, and hence lead to the difficulty distinguishing the "many" frequency components from the "one" frequency component. The merged frequencies and the corresponding merging weights are decided by the filter bandwidth and the magnitude response of the filter in filter banks. For the filter bank designed in last section, since that the sidelobe attenuation is around 24 dB, the aliasing

Normalized frequency Fig. 5 Alias in neighboring subbands

term of the frequency in a filter band can be reasona- bly approximated by the frequency components from the nearest neighboring band. For the hybrid filter bank design in Fig. 3, aliasing arises from both the lst filter banks and the 2"d filter banks. The aliasing terms in the lSt level filter bank lead to the merging of frequencies with distance as far as ?i./32 while that in the 2"d level filter bank ~ / 5 1 2 . Since that the psychoacoustic models in MPEG needs a frequency reso-

lution ~ / 5 1 2 , the aliasing terms from the l s t level fil- ter bank should be suitably corrected- to increase the frequency resolution.

Fig. 5 shows the frequency responses for the two neighboring filters in the 1'' level filter bank before decimation. The lattice lines in Fig. 5 show the resolution boundary for the 2"d level filter bands. The cross lines in Fig. 5 shows the merged bands from the decimation in the lst level filter bank.

Edler [7] has designed the butterfly structure in Fig. 6 to ease the aliasing errors in hybrid filter banks. The hybrid structure in Fig. 3 has included the butterfly structure to compensate the aliasing terms. The butterfly operation is

( 6 )

U, = d,,, ( r, - cmr, ) with i=16. k-l-m

uJ = d,,, ( rJ

+

c,r, ) with j=16. ktm withd, =l/Jl+c;, - N / 2 1 m < - 1

where ri and rj are the band signals from the 2"d level filter bank indicated by the cross lines in Fig. 5. The

ui and uj are resulted signals after the correction. The Cm and dm are the two weighting factors de-

signed to compensate the aliasing errors in each band. The values of these two factors vary with bands labeled as m indicated in Fig. 5 . In the following we show the method to obtain the values for these weighting factors.

C. Design of the weighting facfors in the butterfly

For the bands other than those labeled as m=-1 and 1, the weighting factors are calculated using the ratio between the filter response energy in the signal band and that of the aliasing band:

Energy of alias band alia s i n g baad

Energy of signal band m _{j I H}_{( w})I2 w

w n e i b a n d

where H(w) is the frequency response of one filter in the l s t filter bank.

, . . . , . . . ,

Fig. 6 Structure of alias re- E Table I bi; duction butterfly f ~ r + r \ r c nf

0.00429

.I

0.99824

. ... .

'

*

'

0.00049

I

1.00000

I

_.

ght weighting

luvcv.Ll alias reduc-

(4)

Liu and Lee: The Design of a Hybrid Filter Bank for the Psychoacoustic Model in ISOMPEG Phases 1 , 2 Audio Encoder 589

-

0

1

However, the compensation should be modified for 1st level 2nd level Critical bands

-

1 I

the bands labeled as m=- 1 and 1. As described above, there are aliasing from the Znd level filter bank. For example, the band labeled as m=2 have aliasing terms from the band labeled as m = l and m=3. However, the aliasing terms for m=-1 and m=l are only from the band m=-2 and m=2, respectively. To take the special effect into the butterfly, the weighting factors for m=- 1, I are calculated as

Energy of alias band (1-r) ₍₈₎

c, Of-C-1 =

d

Energy of signal band m

where y is the ratio between the filter response energy of the signal and the aliasing terms in the 2"d level filter bank. Table 1 summarizes the values of the

Algorithms of frequency mapping in psychoacous- # of multiplications per 1152 samples tic model

1024 pt. FFT (real FFT) + Hann window 32 (32 pt. TDAC filter bank + window)

32 (32 pt. TDAC filter bank

+

window + Alias 3584+32*6*2*2=4352 cancellation)

12 (TDAC + window + Alias cancellation) + criti- 12/32*(4352)=1632 cal bands

4*256*log(512)+5 12=9728 32*16*log(32)+32*32=3584

subband subband .

# of additions per 1152 samples 2*256*log(5 12)+2*5 12*log(5 12)=9216 32*32*1og(32)=5 120 5120+32*6*2=5504 12/32*(5504)=2064

I

H

U

0 Hz

Fig. 7 Hybrid filter bank resolution vs. critical band

weighting factors. Scaling factor

Left Intensity

t

Left Audio in

(5)

590 JEEE Transactions on Consumer Electronics, Vol. 43, No. 3, AUGUST 1997

Table 2 shows the complexity of the hybrid structure compared with the FFT. The 1024-point real-data FFT requires 256*1og(5 12) complex multiplications and 5 12*log(5 12) complex additions with Hann window of 512 multiplications, while 32 2"d level TDAC filter banks with the 6 aliasing cancellation butterfly structures require only an order of 32(16*10g32+32 +6*2*2) when the fast algorithm of the TDAC filter bank [ 101 is applied. Further reduction from the perceptual resolution can reduce the complexity as indicated in row 4 of Table 2.

B. Cooperating with the intensity mode

The other advantage of the substitution of the hybrid structure for the FT in the psychoacoustic models of MPEG is on the stereo encoding. As mentioned in our previous paper [9], the intensity stereo coding is the key technology for layer 2 in MPEG 1 to achieve a near transparent quality at a bit rate as low as 96x2 kbits for the two stereo channels. However, the original FT analysis has problems in maintaining a consistent frequency analysis with the stereo signals. When the high frequency parts of the two stereo channels are combined into one channel in intensity stereo coding or the scheme mentioned in [9] as shown in Fig. 8, original FT analysis result is not representative for the frequency analysis of the combined channels.

One way to overcome this inconsistent problem is to recalculate the FT analysis and the psychoacoustic model for the two channels somehow based on the combined channels. This recalculation leads to heavy computing load. On the other hand, when these stereo coding schemes are applied, the hybrid structure can be easily tuned to a consistent analysis. Modifi- cation of the frequency analysis and the corresponding psychoacoustic model can be performed only on part of the frequency range for the combined channels through the hybrid structure. The hybrid filter bank cooperating with the intensity stereo coding scheme is shown in Fig. 9.

C. Tonality measure

The determination of the tonality of a spectrum line or a band is important in the psychoacoustic model to calculate the sensitivity of the human on the lines or bands. The psychoacoustic model 2 indicated in MPEG draft consider the tonality through a simple prediction calculated in polar coordinates in the complex plane[2]. The tonality detection above is origi- nally designed based on the complex numbers in the output of the Fourier transform. Since that the output of the hybrid filter bank presented in this paper is real

data, the detection mechanism should be suitably modified. The predicted magnitude for a spectrum lines is denoted as ~ ( t , f ) , which is calculated from the two preceding magnitudes r ( f - l , f ) , r ( t - 2 , f ) :

?( t ,

f )

= I-(

t -1,

f)+

(I-(

t -

1,f)-

r

( t -

2 , f ) )

(9) where t and f represent the index of time and frequency, respectively. The tonality factor c(t ,f) used in psychoacoustic model 2 can now be obtained as

J r

( 1 , f ) 2 -

U"

( f ,

1- ( t ,

f

)+ abs (

1'(

f ,

f

))

c ( t , f ) =

For tone signals, the prediction turns out to be very good, and c(t,

8

will have a value near zero. On the other hand, for very unpredictable signal such as noise signals, c(t, f ) will have a value near 1.

V. Quality Measure

The effects of the hybrid filter bank and the corresponding modification can be illustrated by compar- ing the spectrum from the FT and that from the hybrid filterbank. The spectrum analysis for signals with five components at frequencies 400Hz, SOOHz, 1600Hz, 3200Hz and 6400Hz are shown in Fig. 10 through the FT (dotted line), the hybrid filter bank without alias reduction (dashed line with lOOdB shifting up) and the hybrid filter bank with alias reduction (solid line with 200dB shifting up). The location of each frequency of the hybrid filter bank are almost the same as the one of FT and the alias component of the hybrid filter bank with alias reduction can effec- tively reduce the aliasing terms.

Several audio segments has been adopted to meas-

Frequency (Bin)

Fig. 10 Signal with frequency located at 400Hz, 800Hz, 1600Hz, 3200Hz and 6400Hz analyzed by 1024 pt. FT (dotted line), the hybrid filter bank (dashed line) and the hybrid filter bank with alias reduction butterfly (solid line)

(6)

Liu and Lee: The Design of a Hybrid Filter Bank for the Psychoacoustic Model in ISOMPEG Phases 1 . 2 Audio Encoder 591

ure the signal-to-masking ratio [9] from the FT and the various hybrid filter bank. Two of the results are shown in Fig. 11 and Fig. 12 where the FT is denoted by the solid line, the hybrid filter bank with alias reduction by dotted line, and the hybrid filter bank with only 12 bands in the 2nd level by dashed line. The results show that the hybrid filter bank with low

e -140k

-I

- 1 7 0 t 1

,

I

,

I I

,

I

I 1

3 6 9 12 I S 18 21 24 27 30

-2000

Subband

Fig. 11 Average signal-to-masking ratio of each subband for female vocal sound

h

3

3 2‘

d 8 i? 1 7 0 t 1 I I I I I I I

I 1

3 6 9 12 15 18 21 24 27 30 2000 Subband

Fig. 12 Average signal-to-masking ratio of each subband for classical symphony orchestra complexity can provide a result similar to the FT.

Also, informal listening tests show that the audio segments coded by the psychoacoustic model of the FT and the hybrid filter bank are almost impercepti- ble.

VI. Concluding Remarks

This paper has presented a new design named hybrid filter banks to replace the FT adopted in the psychoacoustic model suggested in the draft on the MPEG phases I and I1 audio coding. This paper has given the means to solve the phase shift and aliasing problems in the hybrid structure. The hybrid filter

bank can be well integrated with the psychoacoustic model and provide a much lower complexity than the FT. We have also shown that the hybrid filter bank can cooperate with intensity stereo coding scheme to obtain higher audio quality. Due to the flexibility of the hybrid filter bank, a consistent psychoacoustic model with the intensity stereo coding channel can be obtained with little computation increasing. The hybrid filter bank is tested through spectrum analysis, subjective measure, and objective measure to show the feasibility.

References

[ l ] J. D. Johnston, “Transform coding of audio signals using

perceptual noise criteria,” IEEE Journal on Selected Area in

Communications, vol. 6, no. 2, pp. 314-323, Feb, 1988.

[2] K. Brandenburg, J. D. Johnston, “Second level perceptual

audio coding: the hybrid coder,” The 88th Convention of AES,

March 13-16, 1990.

[3] R N. J. Veldhuis, “Bit rates in audio source coding,” IEEE

Journal on Selected Areas in Communications, vol. 10, no. 1,

pp. 86-96, Jan, 1992.

[4] E. 0. Brigham, “The fast Fourier transform and its applica-

tion,” Prentice Hall Inc., 1988.

[5] P. P. Vaidyanthan, “Multirate digital filters,” Prentice Hull

Inc., 1993.

[6] J . Princen, A. Johnson, A. Bradley, “Subband/ transform coding using filter banks designs based on time domain aliasing cancellation,” Proc. of the ICASSP 1987, pp. 2161-2164. [7] B. Edler, “Aliasing reduction in sub-bands o f cascaded filter

banks with decimation,” Electronic Letters vol. 28, no. 12, pp. 1104-1106, Jun., 1992.

[8] K. Brandenburg, E. Eberlein, J. Herre, B. Edler, “Comparison of filterbanks for high quality audio coding,” IEEE Interna-

tional Symposium on Circuit and Systems vol. 3, pp. 1336-

1339, 1992.

[9] C. M. Liu and J. C. Liu, “A new intensity stereo coding

scheme for MPEG audio encoder- layer I and 11,” IEEE Trans.

on Consumer Electronics, vol. 42, pp. 535-539, Aug., 1996.

[ lo ] T. Sporer, K. Brandenburg, B. Edler, “The use o f multirate

filter banks for coding of high quality digital audio,” The 6th

European Signal Processing Conference, vol. 1, pp. 21 1-214,

Jun., 1992.

Chi-Min Liu received the B.S. degree in electrical engineering from Tatung Institute of Tech- nology, Taiwan, R.O.C. in 1985, and the M.S. degree and Ph. D. degree in electronics from Na- tional Chiao Tung University, Hsinchu, Taiwan, in 1987 and

1991, respectively.

He is currently an Associate Professor of the De- partment of Computer Science and Information Engi- neering, National Chiao Tung University, Hsinchu, Taiwan. His research interests include video/audio

(7)

592 IEEE Transactions on Consumer Electronics, Vol. 43, No. 3, AUGUST 1997

compression, speech recognition, radar processing, and application-specific VLSI architecture design.

Wen-Chieh Lee received the B.S.

degree from the Department of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu, Taiwan in 1995. He is currently a Ph.D. stu- dent of the Department of Com- puter Science and Information En- gineering, National Chiao Tung University, Hsinchu, Taiwan. His research interest is in the area of audio compression