Combining HRTF and ATF - 3D Acoustic Signal Synthesis

Chapter 3 3D Acoustic Signal Synthesis

3.3 Combining HRTF and ATF

(a) (b)

(c)

Fig. 3.5 Combining ATF and HRTF

(a) ATF for Each Separated Signal (b) HRTF for Each Separated Signal (c) 3D Acoustic Signal Synthesis

There are many different kinds of blind source separation methods, but it is quite difficult to completely separate the source signals in general cases since the information about the source signals and the mixing system is not fully given. The performance of the separation results may degrade owing to the channel noise, room reflections and some

violations of the source signal stochastic model assumptions, which are usually different for speech signals and instrument signals. However, the interferences which are introduced by other source signals can be less significant as our main purpose of separating these source signals is to synthesize them back together.

With the HRTF database and the ATF-pool, the audience is allowed to choose the arrangement of the source signals and listening position arbitrarily. In other words, the audience can have one source signal at the left side and another at the right side, which are unrelated to the original geometric spots of these source signals in the room. The spatial impression is presented with the headphone by utilizing the HRTF database and the ATF-pool to simulate the user-customized listening scenarios. Therefore, the audience can hear the synthesized 3D feeling audio signals at their own sweet spots.

For the point which does not have an ATF measurement, the estimation of its ATF is calculated by a weighted linear interpolation from the nearby measured ATFs. The weighted linear interpolation method also appears in the calculation of HRTF when the desired spatial position of HRTF cannot be found from the HRTF database.

Let y t_i( ) be the separated signal corresponding to the source signal s t_i( ) ,

Fig. 3.6 Zones of Possible Psychoacoustic Spatial Variation for the Separated Signals

Owing to the interference in the separated signals, the psychoacoustic spatial impression may be degraded by the interaural time difference (ITD) and interaural level difference (ILD). The zone of possible psychoacoustic spatial variation for each source alters based on the SIR of each separated signal. The remaining interference for the i -th separated signal affects the j -th separated signal for all ji. The subject performance degradation for such interferences depends on the human psychoacoustic resolutions of the azimuth angles, the elevation angles and the distance. For a far-field virtual listening point, the distance resolution would be less significant due to the human psychoacoustic characteristics, and the azimuth angles and the elevation angles dominate the main 3D acoustic feeling.

Chapter 4 Experiment Results

4.1 Descriptions of the Adopted BSS System

We adopt the frequency domain independent component analysis (FD-ICA) in this paper with principle component analysis (PCA) as a preprocessing dimension reduction method. We choose the Infomax method combined with the natural gradient method due to the popularity and simplicity of these two methods. The signals are separated in the time-frequency domain and each frequency band is separated individually so that the permutation and scaling problems should be fixed after the ICA process. We solve the permutation problem by the combination of the DOA approach, the neighboring correlation approach and the harmonic frequency approach. The scaling problem is solved by using the minimum distortion principal method. For the convolutive BSS method, we adopt a least squares optimization technique based on the cross-power-spectrum approach with the gradient descent algorithm. The flow diagram for the overall BSS system is shown as Fig.

4.1 below.

Fig. 4.1 Flow Diagram of the Adopted BSS System

Fig. 4.2 shows the arrangement of source signals and the microphone array on the X-Y plane. Two source signals are located 3.00 (m) away from each other and the interval length of the microphone array is equal to 0.50 (m). The middle point of the two source signals is 3.00 (m) away from the center of the seven microphone signals.

Fig. 4.2 Arrangement of the Source Signals and the Microphone Array

The settings of detailed parameters about the BSS system are shown in the Table 4.1.

The thresholds th_ and th_U are assigned to make sure the DOA calculation is confident, and the threshold th_Ha is adjusted based on the number of sources and the size of the harmonic set. The range K affects the convergence speed of the convolutive BSS method.

For a larger K value, it takes more computational time to search for the valid demixing matrix W .

Table 4.1 Settings of the BSS System Parameters Parameters of the BSS System Values

Sampling Frequency 44.1 kHz

Number of Microphones, M 7

Number of Sources, N 2

Length of STFT, T 8196 pt

Frame Shift of STFT 128 pt

Window Function Hamming

Thresholds of confident DOA ^th ^1.5, th_U 10dB Distance for Interfrequency Correlations,  ³ ^f

Set of Harmonic Frequencies



^{2 , 2}^f ^f ^{ }^f^{, 3 , 3}^f ^f ^{ }^f



Threshold of Harmonic Correlations, th_Ha 1.2

Learning Rate,  1.0

Number of Iterations 1000

Nonlinear Function, g u( ) ^tanh



GRe u



 j^tanh



GIm^{ }u



Gain of Score Function, G 100

Range of LS Optimization, K 5

Fig. 4.3 SIR of the Demixing Matrix from No Reflection (NR) Microphone Recordings

Fig. 4.4 SIR of the Demixing Matrix from Perfect Reflector (PR) Microphone Recordings

Fig. 4.5 Averaged SIR of NR and PR

Table 4.2 Source Types in Sequence Numbers Sequence

Number

Sequence

Abbreviation Source 1 Source 2

1 f01m01 Chinese speech, female Chinese speech, male 2 instru instrument, string 1 instrument, string 2 3 speech Japanese speech, female Japanese speech, male

4 winter instrument, drums instrument, piano

5 wistru instrument, string 1 instrument, piano

There are five sets of data being processed from top to toe, which are “f01m01”,

“instru”, “speech”, “winter”, and “wistru”. The “f01m01” sequences are two Chinese speech signals of a man and a woman; the “instru” sequences are two string instrument signals; the “speech” sequences are two Japanese speech signals of a man and a woman; the

“winter” sequences are instrument signals of drums and a piano; the “wistru” sequences are a string in “instru” and the piano in “winter”. The lengths of all these wave files are about 6.8 second.

The effectiveness of the demixing matrix W can be measured as the SIR values of the microphone array signals. In Fig. 4.3, the SIR of the demixing matrix from no reflection (NR) recordings shows good performance in average. The sequence number corresponds to different test sequences which are shown in Table 4.2. When the wall material changes to the perfect reflectors (PR) in Fig. 4.4, the SIR values drop to around 7dB. In Fig. 4.5, the averaged SIRs of NR are higher than the ones of PR for all input sequences. The reason for this phenomenon can be easily understood since the reflections make the purely time-delayed BSS problem into a convolutive one. Thus, the independence of the source signals is disturbed.

For the fifth sequence “wistru”, the SIR difference of source 1 and source 2 is the

largest among the five sequences in both the NR and PR conditions. The explanation comes from the waveforms in Fig. 4.22 (c), (d) and Fig. 4.23 (c), (d). Note that the graphs of waveforms and spectrograms were normalized to the interval [-1, 1] for observation. Thus, the true amplitude cannot be observed from the waveforms of the source signals, but we can easily find that the mixture signals are dominated by the source 2 in the “wistru” sequence.

Owing to the larger true magnitude of the source 2 (piano), the interference from source 2 to the separated signal 1 is still significant. On the other hand, the interference from source 1 to the separated signal 2 is insignificant in terms of the relative power ratio. However, in the two source case, the relative power ratio would be eliminated in the averaged SIR. Recall that the separated signals can be modeled as:

1 2

Therefore, the averaged SIR of “wistru” goes back to the normal range of the sequences.

(a) (b)

(e) (f)

Fig. 4.6 Sequence “f01m01” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.7 Sequence “f01m01” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.8 Sequence “f01m01” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.9 Sequence “f01m01” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.10 Sequence “instru” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.11 Sequence “instru” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.12 Sequence “instru” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.13 Sequence “instru” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.14 Sequence “speech” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.15 Sequence “speech” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.16 Sequence “speech” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.17 Sequence “speech” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.18 Sequence “winter” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.19 Sequence “winter” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.20 Sequence “winter” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.21 Sequence “winter” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.22 Sequence “wistru” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.23 Sequence “wistru” Waveforms in Time Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.24 Sequence “wistru” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

(a) (b)

(e) (f)

Fig. 4.25 Sequence “wistru” Spectrograms in Time-Frequency Domain (a) Source 1 (b) Source 2

4.2 Virtual Acoustic Environment

4.2.1 Introduction to NASA Sound Lab (SLAB) Software

Fig. 4.26 Snapshot of the 3D Virtual Acoustic Room in SLAB

SLAB is a software-based real-time virtual acoustic environment rendering system developed by the NASA Ames Research Center. This software provides an offline acoustic environment for spatial hearing and psychoacoustic studies. The acoustic scenario parameters considered in the SLAB include three main categories: the source, the environment, and the listener. The source parameters include the source locations, the source waveforms, the radiation pattern and radius of each source, etc. The environment parameters include the sound speed, the air absorption, the surface locations, the room dimension and the surface reflections, etc. The listener parameters include the listener location, the HRTF model and the interaural time difference (ITD), etc. There are some other specifications about the SLAB software which are presented in the following section.

Material Filter First-order IIR Filter

Table 4.3 Scenario Specifications [25]

System Dynamics

Sampling Rate 44.1 kHz

Update Rate 120 Hz

Internal Latency 24 msec

FIR Update Every 64 Samples (1.45 msec)

Delay Line Update Every Sample (22.7 μsec)

Table 4.4 System Dynamics Specifications [25]

Numerical Precision

Sound Input / Output 16-bit Integer

Scenario Double-precision Floating-point

Signal Processing Single-precision Floating-point

Table 4.5 Numerical Precision Specifications [25]

4.3 Wall Material ATF Characteristics

There are seven kinds of wall materials provided by the SLAB software. The ATF spectrum is estimated by the TSP signal changes along with different wall materials. The tail of the time domain TSP signal with N = 2048 and M = 64 appends some padding zeros in order to observe the effect of reflections from the six-sided wall materials. As in Fig.

4.27(b) shown, the padding zeros introduce some tolerable amplitude distortions.

(a) (b)

Fig. 4.27 TSP Signal with Padding Zeros (a) Time Domain (b) Frequency Domain Amplitude

The frequency spectrum characteristics for the seven materials and the no reflection scene are shown as Fig. 4.29 from (a) through (h). All the data of Fig. 4.29 are the ATFs measured from the source 1 (red point) to the virtual listening point at (1.25, 0, 1.5) in the median room of the dimension 10 x 10 x 10 in meters. The left column of Fig. 4.29 shows the frequency domain log10 amplitudes and the right column shows the frequency domain unwrapped phase. The name list of the eight wall properties are no reflection (NR), perfect reflector (PR), heavy carpet (HC), concrete (Co), heavy glass (HG), gypsum board (GB), wood with airspace (WA) and plaster on metal (PM), which are shown in Fig. 4.28.

(a) (b)

(e) (f)

(g)

Fig. 4.28 Wall Materials

(a) Perfect Reflector (b) Heavy Carpet (c) Concrete (d) Heavy Glass (e) Gypsum Board (f) Wood with Airspace (g) Plaster on Metal

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h) Fig. 4.29 ATF Characteristic with Different Wall Materials,

Left: Freq. log10 Magnitude, Right: Unwrapped Phase

(a) No Reflection (b) Perfect Reflector (c) Heavy Carpet (d) Concrete (e) Heavy Glass (f) Gypsum Board (g) Wood with Airspace (h) Plaster on Metal

4.4 Demonstrations of 3D Acoustic Signal Synthesis

Results

In Fig. 4.30, we show the 3D acoustic signal synthesis flow. By dividing the separated signals into parts, we are able to build the 3D acoustic signal as the designed HRTF scenario.

It can be done by filtering each divided parts with its corresponding ATF and HRTF. The order of ATF filtering and HRTF filtering does not affect the output signal but the computational complexity since the HRTF filtering produce a two channel signal for each input signal.

Fig. 4.30 Flow Diagram of 3D Acoustic Signal Synthesis

For each sequence data, we provide three kinds of waveforms: the SLAB synthesis waveform, the HRTF+ATF waveform from the original source signals and the HRTF+ATF waveform from the separated signals.

The demonstrations show two kinds of HRTF scenarios. The first scenario which is shown as Fig. 4.31 has 25 frames and the frame interval is about 0.5 second. The second scenario which is shown as Fig. 4.32 has 27 frames and the frame interval is also about 0.5

second. The red point represents the source 1, the green point represents the source 2 and the blue and red parts of the headphone represent the left and right ear of HRTF respectively.

(a) (b) (c)

(d) (e) (f)

Fig. 4.31 HRTF Scenario 1, 25 Frames, Frame Interval0.5 sec,

Red: Source 1, Green: Source 2 (a) Frame 1 (b) Frame 5 (c) Frame 10

(d) Frame 15 (e) Frame 20 (f) Frame 25

(a) (b) (c)

(d) (e) (f)

(g)

Fig. 4.32 HRTF Scenario 2, 27 Frames, Frame Interval  0.5 sec,

Red: Source 1, Green: Source 2 (a) Frame 1 (b) Frame 5 (c) Frame 8 (d) Frame 13 (e) Frame 18 (f) Frame 21

(g) Frame 27

In order to amplify the noticeable effect of the ATF, we demonstrate the 3D acoustic signals for three different room sizes: large room with 20 x 20 x 20 (m), median room with 10 x 10 x 10 (m) and small room with 4 x 4 x 4 (m), which are shown in Fig. 4.33.

(a)

(b)

(a) Large Room (b) Medium Room (c) Small Room

From Fig. 4.34 to Fig. 4.41, we can observe the effects of ATF to the waveforms and the spectrograms. By the comparisons of the figures in (a) and the ones in (c), it can be identified that the ATFs change the waveforms of the separated signals; the difference is implicit without reflection (NR), but it is visible for perfect reflectors (PR) as the wall material in the three different room sizes (Small, Medium, Large). The effect of room sizes

to ATFs can be observed in (f). The longer the reverberation time is, the faster the changes in the adjacent frequencies are. The explanation comes from the sum of different time domain shifting of signals cause the frequency domain magnitude variation:

Therefore, for a larger room, there exists some larger value of t_k t_m which cause a faster oscillation of the spectrum. By comparing the spectrograms in (e) with those in (b), we are able to see some blue slices at the frequencies with lower spectrum magnitudes in (f). After the HRTF filtering, the interchannel level difference (ILD) is noticeable in (d), which is related to the HRTF azimuth angle. For the signals at 45^, the left channel amplitude is much larger than the right one; in the other hand, for those at 45^, the right channel amplitude is larger than the left one.

(a) (b)

(e) (f)

Fig. 4.34 “f01m01”, Separated Signal 1, NR, HRTF at 45^

(a) Separated Signal in Time Domain (b) Separated Signal in Time-Frequency Domain (c) After ATF in Time Domain (d) After HRTF in Time Domain