Development and implementation of cross-talk cancellation system in spatial audio reproduction based on subband filtering

(1)

JOURNAL OF SOUND AND VIBRATION

Journal of Sound and Vibration 290 (2006) 1269–1289

Development and implementation of cross-talk

cancellation system in spatial audio reproduction

based on subband ﬁltering

Mingsian R. Bai

, Chih-Chung Lee

Department of Mechanical Engineering, National Chiao-Tung University, 1001 Ta-Hsueh Road, Hsin-Chu 300, Taiwan Received 2 April 2004; received in revised form 25 April 2005; accepted 16 May 2005

Available online 8 August 2005

Abstract

An integrated spatial audio system based on subband filtering and a panel speaker array is developed in this paper. This system is intended for a personal computer and is capable of rendering sound images positioned arbitrarily around a listener, synchronizing with the video image. The proposed system features the spatial audio technologies such as the head-related transfer function (HRTF), the reverberator, and the cross-talk cancellation system (CCS). Of particular importance in this paper is that inverse filtering with Tikhonov regularization technique is employed in designing the multi-channel filters to cancel the cross-talks. To ease the computation loading, the CCS is implemented for frequencies below 5.5 kHz by using a subband filtering approach. The system is implemented on a 5 1 panel speaker array. Experimental investigation indicated that the proposed system is effective in creating an immersive sound field, in complement with video rendering.

1. Introduction

A spatial audio system enables positioning sound images in arbitrary directions and distances. Immersive sensation in a three-dimensional (3D) sound ﬁeld is created with the aid of computers

www.elsevier.com/locate/jsvi

_{Corresponding author. Fax: +886 3 5720634.} E-mail address: [email protected] (M.R. Bai).

(2)

and digital signal processing. Spatial audio finds applications in virtual reality computer games, home theater, PC multimedia, flight/driving simulator, teleconferencing, and so forth. The directional cues for human hearing are embedded in the transformation of sound pressure from the free field to the ears of a listener. A head-related transfer function (HRTF) [1,2] is a measurement of such transformation for a specific sound location relative to the head, and describes the diffraction of sound by the torso, head, and external ear. A synthetic binaural signal can be created by convolving a sound with the appropriate HRTFs. Spatial audio effects are generally rendered with headphones or loudspeakers. While headphones are often used for binaural audio reproduction, they often suffer from in-head localization or front-back reversals [3], not to mention the inconvenience during wearing. An alternative way that avoids the problems of headphones is to use stereo loudspeakers for audio reproduction. Despite the benefits of loudspeaker rendering, cross-talk arises as a major problem that could adversely effect the 3D audio quality due to the Haas effect[4]. This motivates the development of cross-talk cancellation systems (CCS) that seek to minimize the influences due to cross-talks in loudspeaker reproduction.

Methods have been suggested to address the cross-talk cancellation problem. The first proposition of CCS was perhaps due to Schroeder and Atal[5], and later Damaske and Mellert [6]. Their systems were limited to a zone allowing 75–100 mm head movement beyond which the spatial sound effect would vanish. Cooper and Bauck suggested a method that modeled the head as a sphere and calculated the ipsilateral and contralateral terms[7]. A similar method by Gardner approximates the effect of the head with geometric delays and low-pass filters that account for head shadowing [8]. Cooper and Bauck [9] and Bauck and Cooper [10] suggested a simplified CCS, or the ‘‘shuffler filter,’’ by assuming the left-right symmetry of system functions. One issue that frequently arises in practical application is the head movement of listener. To cope with the issue, head-tracking CCS have been reported[11,12]. Alternatively, as suggested by Takeuchi and Nelson, the robustness of CCS can be enhanced against head movement by closely placing two speakers to form the so-called stereo dipole [13]. An extension of this concept was the optimal source distribution (OSD) system [14].

In this work, a spatial audio system featuring HRTF, CCS, and a reverberator is integrated with the panel speaker and array signal processing technology. This system is intended for a personal computer with a single user, and is capable of rendering sound images positioned arbitrarily around the listener, synchronizing with the video image. The block diagram of the integrated system is shown in Fig. 1, wherein the HRTFs position the sound sources and down-mix into binaural signals, the reverberator simulates room effects, and the CCS cancels cross-talks. Due to space limitation, this paper focuses primarily on the development of CCS. Our CCS filters are designed, using a method parallel to the inverse filtering technique suggested by Kirkeby et al. [15]. However, the present method differs from Kirkeby’s method in that a more sophisticated frequency-dependent regularization scheme is employed in the filter synthesis stage [16]. Another unique feature of the proposed system is the band-limited implementation using subband filtering. In considering the computation loading and robustness against uncertainties of HRTFs and head movement, the proposed CCS is band-limited to frequencies below 5.5 kHz[17]. To accomplish the band-limited implementation, we adopted a subband filtering technique based on a M-channel quadrature mirror filter (QMF) bank [18]. In this design, the perfect reconstruction (PR) condition is fulfilled.

(3)

Instead of using conventional stereo loudspeakers, the proposed CCS is implemented on a panel speaker array. The panel speakers are light, thin and small, making them well suited for computer multimedia applications. Detailed investigation on panel speakers can be found in Ref.[19]. On the other hand, there are three reasons of using such array conﬁguration. First, the closely spaced panel speakers provide robustness against head misalignment and compactness enabling direct placement on a computer monitor. Second, the deﬁciency associated with panel

PCM wave stream (AC-3 or dts)

HRTFs (FL, FR) HRTFs (SL, SR) HRTFs (C) Mixer FL, FR SL, SR C LFE A. HRTF Binaural signals Reverberator CCS

Spatial audio filters

Sound card

Power Amp

Panel speaker array #1 #2 #3 #4 #5

Listener

B.

(4)

speakers such as non-flat frequency response and low bass efficiency can be corrected by using an array configuration [19]. Third, a wide range of array signal processing techniques can be exploited for beam forming and steering purpose [20]. Array speakers give us more latitude in controlling the sound field in the design of a CCS. There are generally six output channels on a sound card of multimedia PC—one for subwoofer and the others can be connected to the 5 1 panel speaker array.

The proposed CCS is applied to multimedia presentation on a personal computer. Numerical simulation and experimental investigation are carried out to justify the proposed spatial audio system. Design issues and technical considerations are discussed.

2. Inverse ﬁltering with Tikhonov regularization

As mentioned previously, a CCS aims to cancel the cross-talks in stereo loudspeaker rendering so that the binaural signals are reproduced at two ears as that from a headphone. This can be regarded as a model-matching problem shown in Fig. 2. xðzÞ is a vector of U program input signals, uðzÞ is a vector of B binaural signals, vðzÞ is a vector of S speaker input signals, wðzÞ is a vector of R reproduced signals, dðzÞ is a vector of R desired signals, and eðzÞ is a vector of R error signals. MðzÞ is an R B matrix of matching model, HðzÞ is an R S plant transfer matrix, and CðzÞ is an S B matrix of CCS ﬁlters. The term zm accounts for the modeling delay to ensure causality. For the system, it is straightforward to establish the following relationships:

vðzÞ ¼ CðzÞuðzÞ, (1) wðzÞ ¼ HðzÞvðzÞ, (2) dðzÞ ¼ zmMðzÞuðzÞ, (3) v (z) Speaker input w(z) Reproduced d (z) Desired e (z) Error CCS matrix S x B C (z) Plant transfer matrix R x S H (z) Matching model B Rx M (z) Modeling delay z-m u (z) Binaural S (z) Synthesis matrix U Bx x (z) Program

Fig. 2. The model matching problem of the CCS. xðzÞ is a vector of program input signals, uðzÞ is a vector of binaural signals, vðzÞ is a vector of speaker input signals, wðzÞ is a vector of reproduced signals, dðzÞ is a vector of desired signals, and eðzÞ is a vector of error signals. MðzÞ is a matrix the matching model, HðzÞ is the plant transfer matrix, and CðzÞ is a matrix of CCS ﬁlters.

(5)

eðzÞ ¼ dðzÞ wðzÞ. (4) Ideal model matching requires that HðzÞCðzÞ ¼ MðzÞ. HðzÞ is generally non-invertible because it is usually ill-conditioned and even non-square. To overcome this difﬁculty, we employ the Tikhonov regularization [16] in the matrix inversion process. In the method, a frequency-domain cost function J is deﬁned as the sum of the ‘‘performance error’’ eH_{e and the ‘‘input power’’ v}H_v:

JðejoÞ ¼eHðejoÞeðejoÞ þb2ðoÞvHðejoÞvðejoÞ. (5) A regularization parameter bðoÞ weighs the input power against the performance error. If b is too small, there will be sharp peaks in the frequency responses of the CCS ﬁlters, whereas if b is too large, the cancellation performance will be rather poor. The optimal input voptðejoÞ can be obtained by minimizing J

voptðejoÞ ¼ ½HHðejoÞHðejoÞ þb2ðoÞI1HHðejoÞMðejoÞuðejoÞ. (6) This solution always exists for ba0 irrespective of the dimensions and rank of Hðejo_Þ. Consequently, the CCS matrix can be readily identiﬁed as

CðejoÞ ¼ ½HHðejoÞHðejoÞ þb2ðoÞI1HHðejoÞMðejoÞ. (7) In the case when the desired signals dðzÞ are identical to the binaural signals uðzÞ, the matrix MðzÞ is an identity matrix of order R ¼ B and the corresponding optimal ﬁlters are given by

CðejoÞ ¼ ½HHðejoÞHðejoÞ þb2ðoÞI1HHðejoÞ. (8) While the expression in Eq. (8) may look similar to that in Ref.[15], there is a distinction in the choice of bðoÞ. In our approach, the parameter bðoÞ is frequency dependent and constrained by a gain threshold applied to CðejoÞ, e.g., 12 dB. This is in contrast to the approach in Ref.[15], where a constant bðoÞ applied to all frequencies.

Next, the frequency response matrix CðejoÞ is sampled at Nc equally spaced frequencies CðkÞ ¼ ½HHðkÞHðkÞ þ b2ðoÞI1HHðkÞ; k ¼ 1; 2; . . . ; N. (9) The impulse responses of the inverse ﬁlters can be obtained using inverse FFT of the frequency samples of Eq. (9) in conjunction with appropriate windowing. In order to guarantee the causality of the CCS ﬁlters, cyclic shift of the impulse response matrix is generally needed, hence the modeling delay zm inFig. 2.

3. Band-limited implementation using the multirate approach

Band-limited implementation is chosen in this work for several reasons. First, the computation loading is too high to afford a total band (0–20 kHz) implementation. For example of the 5 1 panel speaker array considered herein, the CCS would contain 10 ﬁlters. If each ﬁlter has 1024 taps, the convolution would require 104multiplications and additions per sample interval. Except for special-purpose DSP engine, real-time implementation for a total band CCS is usually prohibitive for the sampling rate commonly used in audio processing, e.g., 44.1 or 48 kHz. Second, at high frequencies, the wavelength could be much smaller than a head width. Under this circumstance, the CCS would be extremely susceptible to misalignment of the listener’s head and

(6)

uncertainties involved in HRTF modeling. Third, at high frequencies, a listener’s head provides natural shadowing for the contralateral paths, which is more robust than direct application of CCS. Fig. 3(a)–(c) shows the simulation results of sound pressure distribution obtained from a 6 1 speaker array at 5, 6, and 7 kHz, respectively, in the far ﬁeld. The head position is indicated by a circle at the origin. The axes X and Y indicate the coordinates in cm; the bar represents sound pressure in dB. The null zone decreases with increasing frequency. For these reasons, the CCS in this study is chosen to be band-limited to 5.5 kHz (the wavelength at this frequency is approximately 6 cm). To accomplish this, a four-channel QMF bank[18]is employed to divide the total audible frequency range into subbands for CCS and direct transmission, respectively. 3.1. Theoretical background

In this section, a brief review of some techniques in multirate signal processing is given in the context of the band-limited CCS design. We begin with two fundamental operations: decimation and interpolation.Fig. 4(a) shows an M-fold decimator that produces the output sequence

y_DðnÞ ¼ yðMnÞ, (10)

where M is an integer. In frequency domain, it can be shown that the output YDðejoÞand the input Y ðejo_Þ _{of the decimator are related by}

YDðejoÞ ¼ 1 M X M1 k¼0 Y ðejðo2pkÞ=MÞ. (11)

On the other hand,Fig. 4(b) shows an L-fold expander that takes the input xðnÞ and produces an output sequence

y_EðnÞ ¼ xðn=LÞ if n is integer-multiple of L;

0 otherwise:

(12) In frequency domain, it can be shown that the output YEðejoÞ and the input X ðejoÞ of the expander are related by

YEðejoÞ ¼X ðejoLÞ. (13)

In general, the decimator is preceded with a digital lowpass filter called the decimation filter and the expander is followed by a digital lowpass filter called the interpolation filter. These play similar roles as the anti-aliasing filter and reconstruction filter in analog signal processing.

It was not until the discovery of two concepts, the noble identities and polyphase representation [18], that multirate signal processing became efﬁcient enough for practical implementation. Fig. 5(a) and (b) depict the idea of the noble identities, wherein two systems are equivalent. In the polyphase representation, on the other hand, a rational transfer function GðzÞ can be decomposed as GðzÞ ¼ X 1 n¼1 gðnMÞznMþz1 X 1 n¼1 gðnM þ 1ÞznM þ þzðM1Þ X 1 n¼1 gðnM þ M 1ÞznM. (14)

(7)

Fig. 3. The simulation results of sound pressure distribution obtained from a 6 1 speaker array in the far ﬁeld. (a) 5 kHz, (b) 6 kHz, (c) 7 kHz.

(8)

This can be compactly written as GðzÞ ¼ X M1 ‘¼0 z‘E‘ðzMÞ, (15) where E‘ðzÞ ¼ X1 n¼1 e‘ðnÞzn (16) L x(n) yE(n) G(z) y(n) Interpolation filter Expander M x(n) yD(n) G(z) y(n) Decimation filter Decimator ↓ ↓ (a) (b)

Fig. 4. Two building blocks in multirate signal processing. (a) The decimation module, (b) the interpolation module.

M G(z) M G(zM) L G(z) L G(zL) ↓ ↓ ↓ ↓ (a) (b)

(9)

with

e‘ðnÞ9gðMn þ ‘Þ; 0p‘pM 1. (17)

Eq. (15) is called the Type 1 polyphase representation and E‘ðzÞ is the polyphase component of GðzÞ. An alternative way of writing Eq. (14) is the Type 2 polyphase representation

GðzÞ ¼ X M1

‘¼0

zðM1‘ÞR‘ðzMÞ. (18)

In fact, the Type 2 polyphase components R‘ðzÞ are the permutations of E‘ðzÞ, i.e., R‘ðzÞ ¼ EM1‘ðzÞ.

For example, consider the decimation filter shown inFig. 4(a) with M ¼ 2. Representing GðzÞ by Eq. (15) leads to the block diagram ofFig. 6(a). By invoking the first noble identity, this can be redrawn asFig. 6(b). It can be shown that the modified implementation is more efficient than the direct implementation of the filter GðzÞ. An alternate structure can also be obtained by using the Type 2 polyphase representation, as shown in Fig. 6(c). These representations permit great simplification of computation and efficient implementation of decimation and interpolation filters, as will be discussed next in the CCS filter bank design.

3.2. M-channel QMF bank

Fig. 7(a) shows the two-channel version of a QMF bank, wherein G0ðzÞ and G1ðzÞ are lowpass and highpass ﬁlters shown in Fig. 7(b). In practice, the analysis ﬁlters have non-zero transition bandwidth and stop-band magnitude. Consequently, the signals xkðnÞ are not perfectly

2 E0(z2) E0(z2) E1(z2) E1(z2) x(n) y(n) z-1 z-1 2 x(n) x(n) 2 y(2n) 2 R0(z2) R₁(z2₎ z-1 2 y(n) y0(n) y1(n) ↓ ↓ ↓ ↓ ↓ (a) (b) (c)

Fig. 6. The polyphase implementation. (a) Polyphase implementation of a decimation filter, (b) an equivalent implementation of the decimation filter in (a), (c) polyphase implementation of an interpolation filter.

(10)

band-limited, and decimation of which results in aliasing. The reconstructed signal ^xðnÞ generally differs from xðnÞ due to three factors: aliasing, amplitude distortion, and phase distortion. It is possible that the ﬁlters can be designed in such a way that these distortions are eliminated.

From Eqs. (11) and (13), X ðzÞ and ^X ðzÞ are related by ^ X ðzÞ ¼ T ðzÞX ðzÞ þ AðzÞX ðzÞ, (19) where TðzÞ ¼1₂½G0ðzÞF0ðzÞ þ G1ðzÞF1ðzÞ, (20) AðzÞ ¼1 2½G0ðzÞF0ðzÞ þ G1ðzÞF1ðzÞ. (21) T ðzÞ and AðzÞ are called the distortion function and aliasing function, respectively. From Eq. (21), it is clear that we can cancel aliasing by choosing the ﬁlters such that the quantity AðzÞ is zero

F0ðzÞ ¼ G1ðzÞ; F1ðzÞ ¼ G0ðzÞ. (22)

From Eqs. (19)–(22), Eq. (19) becomes ^ X ðzÞ ¼1

2T ðzÞX ðzÞ. (23)

Assume that G0ðzÞ is power symmetric, satisfying e

G0ðzÞG0ðzÞ þ eG0ðzÞG0ðzÞ ¼ 1, (24)

where ~G0ðzÞ ¼ G0ð1=zÞ. By ‘‘power symmetric,’’ we mean that the zeros-phase ﬁlter GðzÞ ¼ e

G0ðzÞG0ðzÞ is a half-band ﬁlter, with GðejoÞ being non-negative. Based on Eqs. (20), (23) and (24), Eq. (23) can be reduced to ^X ðzÞ ¼ 0:5zNX ðzÞ, provided the ﬁlter G1ðzÞ is chosen as

G1ðzÞ ¼ zNGe0ðzÞ (25) G0(z) G0(z) G1(z) G₁(z) F0(z) F1(z) 2 2 2 2 y₀(n) y1(n) xˆ n( ) v₀(n) v1(n) x₀(n) x1(n) x(n) Analysis bank Synthesis bank Decimators 0 2 π π 1 ω Mag Expanders (a) (b)

(11)

for some odd integer N. Therefore, a PR system results and only the design of one ﬁlter G0ðzÞ needs to be concerned.

Consider the structure shown in Fig. 8(a), where a signal is split into two subbands, and after decimation, each subband is again split into two and decimated. The subbands are then combined, two at a time, by using two-channel synthesis ﬁlter banks. This system is said to be a maximally decimated tree structured filter bank[18]. Complete system can be redrawn in an equivalent form shown in Fig. 8(b) by using the noble identities. The resulting ﬁlters GmðzÞ and FmðzÞ can be

) ( ˆ n x G0(z) G0(1)(z) G0(2)(z) G0(2)(z) G1(2)(z) G1(2)(z) F0(2)(z) F0(1)(z) F1(1)(z) F0(2)(z) F1(2)(z) F1(2)(z) G1(1)(z) G1(z) G2(z) G₃(z) F0(z) F1(z) F2(z) F₃(z) 4 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 x0(n) x1(n) x2(n) x3(n) v3(n) v2(n) v1(n) v0(n) y0(n) y1(n) y2(n) y3(n) x(n) x(n) 4 4 4 4 Analysis bank Synthesis bank Decimators z-1 z-1 z-1 ) ( ˆ n x 4 4 4 4 ) (n x 4 4 4 4 E(z) R(z) z-1 z-1 z-1 ) ( ˆ n x Expanders (a) (b) (c)

Fig. 8. The block diagram of a four-channel QMF bank. (a) A two-level maximally decimated tree structured ﬁlter bank, (b) the four-channel maximally decimated ﬁlter bank, (c) polyphase representation of a four-channel QMF bank.

(12)

expressed in terms of the ﬁlters GðkÞ_m ðzÞ and FðkÞ_m ðzÞ as follows: G0ðzÞ ¼ Gð1Þ0 ðzÞG ð2Þ 0 ðz 2_Þ; _G 1ðzÞ ¼ Gð1Þ0 ðzÞG ð2Þ 1 ðz 2_Þ, G2ðzÞ ¼ Gð1Þ₁ ðzÞGð2Þ₀ ðz2Þ; G3ðzÞ ¼ Gð1Þ₁ ðzÞGð2Þ₁ ðz2Þ, F0ðzÞ ¼ Fð1Þ₀ ðzÞFð2Þ₀ ðz2Þ; F1ðzÞ ¼ fð1Þ₀ ðzÞFð2Þ₁ ðz2Þ, F2ðzÞ ¼ Fð1Þ₀ ðzÞFð2Þ₁ ðz2Þ; F3ðzÞ ¼ Fð1Þ₁ ðzÞFð2Þ₁ ðz2Þ. (26) If the two-channel system is PR, then so is the complete system.

In order to enhance computational efﬁciency, the Type 1 polyphase representation is used to express the transfer function GkðzÞ in the form

GkðzÞ ¼ X3 ‘¼0 z‘Gk‘ðz4Þ; k ¼ 0; 1; 2; 3. (27) In matrix form G0ðzÞ G1ðzÞ G2ðzÞ G3ðzÞ 2 6 6 6 6 4 3 7 7 7 7 5¼ E00ðz4Þ E01ðz4Þ E02ðz4Þ E03ðz4Þ E10ðz4Þ E11ðz4Þ E12ðz4Þ E13ðz4Þ E20ðz4Þ E21ðz4Þ E23ðz4Þ E24ðz4Þ E30ðz4Þ E31ðz4Þ E32ðz4Þ E33ðz4Þ 2 6 6 6 6 4 3 7 7 7 7 5 1 z1 z2 z3 2 6 6 6 4 3 7 7 7 5 (28) or gðzÞ ¼ Eðz4ÞdeðzÞ. (29)

The synthesis ﬁlters can also be expressed in a similar manner using Type 2 polyphase representation

FkðzÞ ¼ X3

‘¼0

zð3‘ÞR‘kðz4Þ; k ¼ 0; 1; 2; 3. (30)

Using matrix notations,

F0ðzÞ F1ðzÞ F2ðzÞ F3ðzÞ ½ ¼z3 z2 z1 1 R00ðz4Þ R01ðz4Þ R02ðz4Þ R03ðz4Þ R10ðz4Þ R11ðz4Þ R12ðz4Þ R13ðz4Þ R20ðz4Þ R21ðz4Þ R23ðz4Þ R24ðz4Þ R30ðz4Þ R31ðz4Þ R32ðz4Þ R33ðz4Þ 2 6 6 6 6 6 4 3 7 7 7 7 7 5 ð31Þ or fTðzÞ ¼ zðM1ÞedeðzÞRðz4Þ. (32)

(13)

4. Experimental investigations

In order to justify the proposed integrated spatial audio system, experimental investigations were carried out. This system features a 5 1 panel speaker array, a power amplifier, a personal computer with an Intel Pentium 4 2.2 G processor, and a multi-channel sound card. The system diagram is shown inFig. 1. The audio signal processing part is based on the HRTF, reverberator, and CCS, as mentioned previously. Our evaluation shall focus mainly on the performance of CCS. The 5 1 panel speaker array mounted on the computer monitor serves as the means for audio reproduction. The arrangement of the panel speaker array is depicted inFig. 9(a) and (b). The size of each rectangular panel is 7 cm 6:7 cm and the spacing between adjacent speakers is d ¼ 6:7 cm. The panels are made of PU foam, with thickness 4 mm. Each panel is driven by an electromagnetic exciter affixed to an aluminum frame. The elevation of the panel speaker array is 10 cm higher than the manikin’s ear. A measuring microphone is fitted inside the manikin’s ear. On the other hand, in order to synchronize the audio and video data streams, Microsoft DirectShow[21] is employed for implementation of the spatial audio system. The DirectShow is exploited here as a software platform in Microsoft Windows system for computer multimedia rendering. Fig. 10shows the photo of the complete experimental arrangement. The experiments were conducted inside an anechoic chamber, as shown in the same figure.

The CCS used in this experiment is based on the band-limited implementation. Frequency division is accomplished using a four-channel QMF bank, as detailed in Section 3. With the sampling rate 44.1 kHz, the CCS is band-limited to 5.5 kHz. Fig. 11(a) shows the frequency responses of the four-channel QMF bank. Each FIR ﬁlter has 160 taps. Fig. 11(b) shows the measured frequency response ^X ðzÞ=X ðzÞ resulting from the implementation according toFig. 8(c).

3.35cm 7cm 3.5cm 6.7cm Exciter (a) (b)

Fig. 9. The panel speaker array. (a) Arrangement of the 5 1 panel speaker array, (b) dimensions of panel speakers and the location of the exciter.

(14)

As evident in this result, the thus implemented ﬁlter bank is indeed a PR system since the overall system exhibit constant magnitude and linear phase.

Prior to the design of CCS, the frequency responses of the plant were measured. Since the plant has five speaker inputs and two audio outputs at ear positions, there were 10 frequency responses to be identified. Fig. 12shows the magnitudes of the measured plant frequency responses in dB within the frequency range, 150–5.5k Hz. In the figure, the first index of the frequency response indicates the number of the panel speaker in Fig. 1, while the second index indicates the left or right ear. Notable structural resonances can be seen in the frequency responses, which is a typical feature of panel speakers. In addition, the gains at low frequencies below approximately 1 kHz tend to be somewhat low. For these plant functions, the inverse CCS filters are obtained by using Tikhonov regularization, as detailed in Section 2. Kirkeby et al.[15], applied a constant b to all frequencies. This may not adequately address the fact that the condition number of the plant matrix varies drastically with frequency. Instead, we choose in this paper a frequency-dependent b under the constraint that the magnitude responses of the inverse filters would never exceed 12 dB so as not to over-drive the loudspeakers. The value of b is calculated, under this constraint, is plotted in Fig. 13 for different frequencies. It can be observed from the result that more regularization is applied below 800 Hz than above. Beyond 800 Hz, b settles as a constant. This is not surprising since strong cross-talks exist, as reflected by ill-conditioned matrix H, at low

Panel speaker array KEMAR

Panel speaker array

KEMAR 80 cm

Panel speaker array

7

LCD monitor

(15)

frequencies and overly large gains can easily result if not adequately regularized. According to this setting, the resulting frequency responses of the CCS ﬁlters are show inFig. 14.

To facilitate the evaluation of the CCS, we further deﬁne the channel separation as the ratio of the contralateral and ipsilateral frequency responses[17]:

SepðjOÞ ¼ HcontraðjOÞ=HipsiðjOÞ, (33)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -3 -2 -1 0 1x 10 4

Normalized Frequency (xπ rad/sample) Normalized Frequency (xπ rad/sample)

Phase (degrees) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -100 -50 0 50 Magnitude (dB) G0 G0 G1 G1 G2 G2 G3 G3 0 2000 4000 6000 8000 10000 12000 -40 -30 -20 -10 0 10 Frequency (Hz) Mag (dB) 0 2000 4000 6000 8000 10000 12000 -400 -300 -200 -100 0 Frequency (Hz)

Phase angle (radians)

(a)

(b)

Fig. 11. The four-channel QMF bank. (a) The magnitude and phase of frequency response of each subband ﬁlter, (b) the magnitude and phase of the overall frequency response of the QMF bank, ^X ðzÞ=X ðzÞ.

(16)

1000 2000 3000 4000 5000 -40 -30 -20 -10 0 10 20 H1L 1000 2000 3000 4000 5000 H1R 1000 2000 3000 4000 5000 H2L 1000 2000 3000 4000 5000 H2R 1000 2000 3000 4000 5000 H3L 1000 2000 3000 4000 5000 H3R 1000 2000 3000 4000 5000 H4L 1000 2000 3000 4000 5000 H4R 1000 2000 3000 4000 5000 H5L 1000 2000 3000 4000 5000 H5R Frequency (Hz) Mag (dB) -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20

(17)

where Hipsi and Hcontra symbolize the ipsilateral and contralateral frequency responses, respectively, between the loudspeakers and the ears. The smaller the value of channel separation is, the more effective the cross-talk cancellation is.Fig. 15shows the channel separations in dB for two ears within the band 150–20k Hz. The figure on the top and bottom correspond to the measured separation at the left ear and right ear, respectively. The solid line represents the separation without CCS, or the natural channel separation. The natural channel separation generally exhibits smaller gain in high frequencies than in low frequencies due to the head shadowing effect. The dotted lines represent the results obtained using the proposed band-limited CCS. It is clear from this experimental result that the performance in terms of channel separation resulting from the CCS is rather significant in the band 1k–5.5k Hz. The maximum channel separation attains approximately 30 dB. The poor cancellation performance below 1 kHz may be attributed to the strong diffraction effect and poor loudspeaker response at that frequency range. In addition, the overall performance of the CCS can be better illustrated by examining the matrix product, P ¼ HC, as shown in Fig. 16. Figures on the top and bottom correspond to the magnitude responses of the matrix P at the left ear and the right ear, respectively. The solid and the dotted lines represent the numerical and experimental results, respectively. As we can see from these plots, the matrix P is diagonal-dominant with relatively flat magnitude response throughout the band 1.5k–5.5k Hz. Within this control bandwidth of CCS, the system attempts to approach the identity matching model used in the CCS design. As an additional benefit of CCS, the imperfection of the panel speaker response has been compensated to render even better sound quality than the uncompensated system. In practice, however, it is impossible to achieve perfect cancellation because the plant is non-invertible and the CCS is based on approximated inverse filters. The trend of the experimental results is in good agreement with the numerical simulation, except at high frequencies. The discrepancy between the numerical and experimental results could be due to the poor signal-to-noise ratio at those frequencies. 0.08 0.07 0.06 0.05 0.04 0.03 0.02 Beta 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Frequency (Hz)

(18)

Frequency (Hz) 1000 2000 3000 4000 5000 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20 C11 1000 2000 3000 4000 5000 C12 1000 2000 3000 4000 5000 -40 -30 -20 -10 0 10 20 C21 1000 2000 3000 4000 5000 C22 1000 2000 3000 4000 5000 -40 -30 -20 -10 0 10 20 C31 1000 2000 3000 4000 5000 C₃₂ 1000 2000 3000 4000 5000 -40 -30 -20 -10 0 10 20 C41 1000 2000 3000 4000 5000 C42 1000 2000 3000 4000 5000 -40 -30 -20 -10 0 10 20 C₅₁ 1000 2000 3000 4000 5000 C₅₂ Mag (dB)

Fig. 14. The frequency responses of the CCS ﬁlters. The x-axis is the frequency in Hz and the y-axis is the magnitude in dB.

(19)

5. Conclusions

A spatial audio system based on panel speaker array has been implemented on a personal computer. It is capable of rendering sound images positioned arbitrarily around a listener in synchronization with the video image, providing a useful solution for PC multi-media. Unlike previous systems using stereo loudspeakers, a panel speaker array is employed in this system for its compactness and robustness. The HRTF, reverberator, and CCS are all integrated in one unit. In particular, the last item is accomplished by using inverse ﬁltering in conjunction with Tikhonov regularization. A dynamic scheme in adjusting the regularization parameter b has been proposed in this paper under a speaker input constraint. Such approach suits better the frequency-dependent needs for regularization. As indicated in the experimental results, the band-limited implementation of CCS proved to be effective in canceling the cross-talks within the control bandwidth, with reduced amount of computation loading.

Numerous limitations of the present system are pointed out as follows. First, the computation loading remains an issue for audio/video rendering. Although real-time implementation of this system is possible by using a P4 2.2 G CPU, it takes up almost 50% of the computational power when all effects (HRTF þ reverberator þ CCS) are enabled. Second, head misalignment of the listener remains the primary factor that affects the performance as well as robustness of the CCS, particularly at high frequencies. Methods, either ﬁxed type or adaptive type, are currently being

0 -20 -40 Ma g (dB) 0 -20 -40 Ma g (dB) 103 ₁₀4 Frequency (Hz) 103 104 Frequency (Hz) Left ear Right ear

Fig. 15. The channel separations in dB for two ears within the band 150–20k Hz. The ﬁgure on the top and bottom correspond to the measured separation at the left ear and right ear, respectively (solid line: natural channel separation; dotted line: channel separation obtained using band-limited CCS).

(20)

sought to address this problem. Future research will focus on these aspects to enhance the practicality of the spatial audio system.

Acknowledgements

The work was supported by the National Science Council in Taiwan, ROC, under the project number NSC91-2212-E009-032.

References

[1] W.G. Gardner, K.D. Martin, HRTF measurements of a KEMAR, Journal of the Acoustical Society of America 97 (1995) 3907–3908. 0 -10 -20 -30 -40 -50 Mag (dB) 0 -10 -20 -30 -40 -50 Mag (dB) 1000 2000 3000 4000 5000 Frequency (Hz) 1000 2000 3000 4000 5000 Frequency (Hz) 1000 2000 3000 4000 5000 Frequency (Hz) 1000 2000 3000 4000 5000 Frequency (Hz) 0 -10 -20 -30 -40 -50 Mag (dB) 0 -10 -20 -30 -40 -50 Mag (dB) (a) (b) (c) (d)

Fig. 16. The magnitudes of frequency responses of the matrix P ¼ HC (solid line: numerical result; dotted line: experimental result). (a) Between the left ear and the left binaural signal (P11), (b) between the left ear and the right binaural signal (P12), (c) between the right ear and the left binaural signal (P21), (d) between the right ear and the right binaural signal (P22).

(21)

[2] V.R. Algazi, R.O. Duda, D.M. Thompson, The CIPIC HRTF database, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2001, pp. 99–102.

[3] D.R. Begault, Challenges to the successful implementation of 3-D sound, Journal of the Audio Engineering Society 39 (1990) 864–870.

[4] A. Sibbald, Transaural acoustic crosstalk cancellation, Sensaura White Papers, 1999 (http://www.sensaura.co.uk). [5] R. Schroeder, B.S. Atal, Computer simulation of sound transmission in rooms, IEEE International Convention

Record 7 (1963) 150–155.

[6] P. Damaske, V. Mellert, A procedure for generating directionally accurate sound images in the upper-half space using two loudspeakers, Acoustica 22 (1969) 154–162.

[7] D.H. Cooper, Calculator program for head-related transfer functions, Journal of the Audio Engineering Society 30 (1982) 34–38.

[8] W.G. Gardner, Transaural 3D audio, MIT Media Laboratory Technical Report 342, 1995.

[9] D.H. Cooper, J.L. Bauck, Prospects for transaural recording, Journal of the Audio Engineering Society 37 (1989) 3–19.

[10] J.L. Bauck, D.H. Cooper, Generalized transaural stereo and applications, Journal of the Audio Engineering Society 44 (1996) 683–705.

[11] C. Kyriakakis, T. Holman, J.S. Lim, H. Homg, H. Neven, Signal processing, acoustics, and psychoacoustics for high-quality desktop audio, Journal of Visual Communication and Image Representation 9 (1997) 51–61.

[12] C. Kyriakakis, Fundamental and technological limitations of immersive audio systems, IEEE Processing 86 (1998) 941–951.

[13] T. Takeuchi, P.A. Nelson, Robustness to head misalignment of virtual sound imaging systems, Journal of the Audio Engineering Society 109 (2001) 958–971.

[14] T. Takeuchi, P.A. Nelson, Optimal source distribution for binaural synthesis over loudspeakers, Journal of the Audio Engineering Society 112 (2002) 2786–2797.

[15] O. Kirkeby, P.A. Nelson, H. Hamada, Fast deconvolution of multichannel systems using regularization, IEEE Transactions on Speech and Audio Processing 6 (1998) 189–194.

[16] A. Schuhmacher, J. Hald, Sound source reconstruction using inverse boundary element calculations, Journal of the Acoustical Society of America 113 (2003) 114–127.

[17] W.G. Gardner, 3-D Audio Using Loudspeakers, Kluwer Academic, London, 1998.

[18] P.P. Vaidyanathan, Multirate Systems and Filter Banks, Prentice-Hall, Englewood Cliffs, NJ, 1993.

[19] M.R. Bai, T. Huang, Development of panel loudspeaker system: design evaluation and enhancement, Journal of the Acoustical Society of America 109 (2001) 2751–2761.

[20] D.H. Johnson, D.E. Dudgeon, Array Signal Processing: Concepts and Techniques, Prentice-Hall, Englewood Cliffs, NJ, 1993.