NOVEL BINAURAL SPECTRO-TEMPORAL ALGORITHM FOR NOVEL BINAURAL SPECTRO-TEMPORAL ALGORITHM FOR

(1)

NOVEL BINAURAL SPECTRO-TEMPORAL ALGORITHM FOR NOVEL BINAURAL SPECTRO-TEMPORAL ALGORITHM FOR

SPEECH ENHANCEMENT IN LOW SNR ENVIRONMENTS

Po-Hsun Sung, Bo-Wei Chen, Ling-Sheng Jang and Jhing-Fa Wang

Department of Electrical Engineering, National Cheng Kung University Tainan City, Taiwan, R.O.C.

B ^INAURAL A ^UDITORY P ^ROCESSING A ^BSTRACT

A novel BInaural Spectro‐Temporal (BIST) algorithm is proposed

i thi t i th h i t lli ibilit i l

B ^INAURAL A ^UDITORY P ^ROCESSING A ^BSTRACT

In our experiment, the clean speech and noise are played via two loudspeakers in front of the microphone array(using condition 1: a single microphone and condition 2: an artificial in this paper to increase the speech intelligibility in low or

negative SNR noisy environments. The BIST algorithm consists of two modules. One is the binaural auditory processing for

receiving sound from the specific direction and the other is the E R

g p

head). One is at 0 ^∘ , and the other is at θ = 90 ^∘ . The distance between the loudspeakers and the microphone array are 1 m.

receiving sound from the specific direction, and the other is the spectro‐temporal modulation filter for noise reduction. Most speech enhancement algorithms are not applicable in harsh environments because the energy of speech is overwhelmed by

E XPERIMENTAL R ^ESULTS

gy p y

the noise. To increase the speech intelligibility in low or negative SNR noisy environments, a distinctive approach is

proposed to solve this problem. First, the BIST algorithm takes Figure 2. Function block diagram of binaural auditory processing

equency (Hz)

750 1500 3000

equency (Hz)

750 1500 3000

binaural auditory processing as a spatial mask to separate the speech and noise according to their locations. Next, the modulation filter is applied to reduce the noise source in the scale rate (spectro temporal modulation) domain according to

S ^PECTRO -T ÊMPORAL M ÔDULATION F ÎLTERING

^Fr

e

Time (ms)

500 1000 1500 2000 2500 188

375 Fre

Time (ms)

500 1000 1500 2000 2500 188

375

2500

scale‐rate (spectro‐temporal modulation) domain according to their different acoustic feature. It works like the spectro‐

temporal receptive field (STRF) which is the perception

response of human auditory cortex The experimental results

^H^z) ¹⁵⁰⁰

3000

Hz)

1500 3000

(a) Noisy speech at SNR ‐10dB (b) Enhanced by spectrum subtraction

response of human auditory cortex. The experimental results demonstrate that the proposed BIST algorithm can improve

speech intelligibility by 20%.

_Fr^e^que

ncy (H

188 375 750 1500

Frequency (H

188 375 750 1500

I NTRODUCTION

• Cocktail Party Effect Figure 8. Cochleagram of different signals

Figure 3. Function block diagram of spectro‐temporal modulation filtering

Time (ms)

500 1000 1500 2000 188

Time (ms)

500 1000 1500 2000 188

(C) Enhanced after spatial mask (d) Enhanced by BIST filtering algorithm

cyc/oct)

2 00 4.00 8.00

cyc/oct)

2 00 4.00 8.00

ncy (Hz)

1500 3000

y

• Speech with heavy noise remains a difficult issue to solve because it is hard to separate the mixture of speech

Scale (c

1 2 4 8 16 32 0.50 1.00 2.00

Scale (c

1 2 4 8 16 32 0.50

1.00 2.00

Frequen

500 1000 1500 2000 2500 188

375

and noise efficiently.

750

• Human can pay attention to a particular target while filtering out

th d i d

^{- 32.}^{- 16.}_{Rate (Hz)}^{- 8.}^{- 4.}^{- 2.}^{- 1.} 1. 2. 4. 8.16.32._{Rate (Hz)}

Time (ms)

500 1000 1500 2000 2500

other sound sources even in adverse listening environments.

• The BInaural Spectro‐Temporal (BIST) filtering algorithm is proposed to

Figure 4. Representation of clean speech

In this paper, the clean speech is extracted from noisy sound Summary

filtering algorithm is proposed to solve this issue.

0 5 10

Processing of Temporal Modulation Filter

Amplitude

Original

p p , p y

according to its location and acoustic characteristics step by step. The binaural auditory processing functions as a spatial mask which is used to separate the target source and noises

Summary

We have demonstrated a novel approach to improve the speech intelligibility in negative SNR environments. In this

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000

-10 -5

Sample points

A

0.2 Filtered

based on their directions. Regarding the pattern of sound source, the transformation in the representation of sound is founded in mammalian auditory systems, called spectro‐

t l ti fi ld (STRF ) STRF t th

method, the integration of binaural auditory processing and auditory cortical processing functions as mechanism of human auditory perception. It is proved that the adaption of binaural

h l k h f h l f l d h

-0.1 0 0.1

Amplitude (Normalized) te ed

temporal receptive fields (STRFs). STRFs represent the transformation of neuron responses in the auditory cortex when the sound wave arrives. It decomposes the content of auditory spectrogram into the scale‐rate domain The scale

hearing as spatial mask at the first stage is helpful to reduce the complexity of multi sound sources. Furthermore, the speech can be extracted from noisy signal based on their distinct spectro temporal modulation patterns The objective tests

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000

-0.2

Sample points

Figure 5. Noise and enhanced signal filtered by low‐pass temporal modulation

auditory spectrogram into the scale rate domain. The scale means the spectral modulation and the unit is defined as cycles/octave (or cycles/kHz) and rate means temporal modulation and the unit is cycles/second (Hz). This multi‐

spectro‐temporal modulation patterns. The objective tests proved the effectiveness of this method in speech enhancement. However, the effect of reverberation and the test set of noise source are still not comprehensive enough in

E XPERIMENTAL S ^ETUP

Binaural Binaural sound input

y

domain representation model of speech is proven useful in estimation of speech intelligibility. Hence, speech and noise can be separated due to their different modulation characteristics

test set of noise source are still not comprehensive enough in this study. Besides, the design of an optimal modulation filter applied for any situation is an important step for the future study. In summary, the placement of sensor ends and

Binaural sound input Binaural sound input

in the scale‐rate domain. The spectro‐temporal modulation filter can be used as a mask to remove the noise pattern from noisy speech to get cleaner speech.

implantation of the BIST algorithm in real‐time processing are important to increase the robustness of human computer interface.

Noise source

F ^UNCTION B ^LOCK D ^IAGRAM Reference

D=1m 1 N Mesgarani and E F Chang “Selective cortical representation of attended speaker in

Bi l dit i i d f h i d f

Signal source

1. N. Mesgarani and E.F. Chang., Selective cortical representation of attended speaker in multi‐talker speech perception”, Nature , vol. 485, pp.233‐237, May 2012.

2. X. W. Yang, K. Wang, and S. A. Shamma, “Auditory representations of acoustic‐signals,”

IEEE Trans. Information Theory, vol. 38, pp. 824‐839, Mar. 1992.

3. T. S. Chi, Y. J. Gao, M. C. Guyton, P. W. Ru, and S. Shamma, “Spectro‐temporal

d l i f f i d h i lli ibili ” J l f h A i l

• Binaural auditory processing is used for enhancing sound from a specific direction. It can be realized by the spatial mask

processing.

• Spectro‐temporal modulation filtering is used for removing _{( ) A} _{h i} _{(b) R l} _i _t

Figure 6. Binaural recording at an anechoic room and a real environment

modulation transfer functions and speech intelligibility,” Journal of the Acoustical Society of America, vol. 106, pp. 2719‐2732, Nov. 1999.

4. C. Kim, K. Kumar, and R. M. Stern, “Binaural sound source seperation motivated by auditory processing,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2011), Prague, 2011.

Spectro temporal modulation filtering is used for removing background noise from mixed signals. It is based on cortical representation on the rate‐scale domain.

(a) Anechoic room (b) Real environment

Omni-direction microphone

5. P. Gill, J. L. Zhang, and S. M. N. Woolley, T. Fremouw, and F. E. Theunissen, “Sound representation methods for spectro‐temporal receptive field estimation,” Journal of Computational Neuroscience, vol. 21, pp. 5‐20, Aug. 2006.

6. T. M. Elliott and FE. Theunissen, “The modulation transfer function for speech intelligibility,” PLoS Computational Biology, vol. 5, Mar. 2009.

Contact Information

Uni direction microphone

D=1~3m

intelligibility, PLoS Computational Biology, vol. 5, Mar. 2009.

Contact Information

Peter Po‐Hsun Sung

EECS, National Cheng Kung University

ϴ

_R

=0~90 d

ϴ

_L

NOVEL BINAURAL SPECTRO-TEMPORAL ALGORITHM FOR NOVEL BINAURAL SPECTRO-TEMPORAL ALGORITHM FOR