Chapter 3 Auditory Model and Features
3.3 Cortical Module and Spectro-temporal Modulation Filtering
The second cortical module is inspired from neural activities of the auditory cortex (A1) to different spectro-temporal variations. Such spectro-temporal variations are encoded in two parameters: rate and scale. The rate (or velocity) parameter ω in Hz depicts how fast the signal’s energy varies along the temporal axis. The scale (or density) parameter Ω in cycle/octave characterizes how broad the signal’s energy distributed along the log-frequency axis. In addition, cortical neurons also show different selectivity of FM sweeping directions (upward and downward), which is represented in this module by the sign of the rate parameter (positive/negative for downward/upward sweeping direction).
To derive the spectro-temporal impulse responses of neurons in A1, moving ripple stimuli, the basis functions in the two-dimensional spectro-temporal domain, are used to drive the cortex. Figure 3-6 shows one example of the moving ripple stimulus of rate=+4 Hz and scale=0.5 cycle/octave. Therefore, each neuron in A1 has its own impulse response, which represents its preference on the spectro-temporal pattern shown in the input spectrogram, and is modeled by a 2D filter. To sum up, the first cochlear module of the auditory model produces a two-dimensional auditory spectrogram full of spectro-temporal amplitude modulations. The second cortical module then analyzes the auditory spectrogram by a bank of two-dimensional filters which are tuned to different spectro-temporal modulation parameters. Figure 3-7 demonstrates eight 2D cortical filtering of A1 on a sample spectrogram. The small top panels in each subplot are the impulse responses of different typical neurons tuned to slow/fast rates and coarse/fine scales. The bottom panels are envelopes (local energies) of outcomes of these 2D spectro-temporal filters.
Therefore, a four-dimensional output ( , , , )r t f ω Ω of this module can be formulated as:
( , , , )
4( , )
tf( , ; , )
r t f ω Ω = y t f ∗ STIR t f ω Ω
(3-7) where ( , ; , )STIR t f ω Ω is the spectro-temporal impulse response of the two-dimensional filter tuned to ω and Ω ; and ∗ is the two-dimensional tf convolution in the time and log-frequency axes.FIGURE 3-6
An example of moving ripple stimulus.( Auditory Model Handout, AAIP)
FIGURE 3-7
The response for 8 modeled neurons in the cortex. (Auditory Model Handout, AAIP)The local energy of the four-dimensional output is then computed as:
[ ]
( , , , ) ( , , , ) ( , , , )
E t f ω Ω = r t f ω Ω + jH r t f ω Ω
(3-8) where H[ ]
⋅ is the Hilbert transform along the log-frequency axis. Therefore, for any fixed t-f point in the auditory spectrogram, ( , ; , )E ω Ω t f , which is referred to as the rate-scale representation, records energies of local modulations at different combinations of rate, scale and directionality. As shown in Figure 3-8, the left panel demonstrates an auditory spectrogram and right panels are corresponding rate-scale representations of those two points indicated by ‘x’ in the spectrogram. As seen in the figure, those two ‘x’ points have local modulations dominated at (8 Hz, 4 cycle/octave, upward) and (8~16 Hz, 2~4 cycle/octave, downward) respectively.In summary, the early cochlear module estimates a two-dimensional auditory
spectrogram from a one-dimensional acoustic signal. The second cortical module analyzes amplitude modulations of the 2D auditory spectrogram in the rate-scale-directionality parameter space. Much more extensive details of the description, mathematic formulation and output examples of these two modules can be found in [26].
Frequency (Hz)
Time (ms) Auditory Spectrogram
200 400 600 800 1000 1200 125
FIGURE 3-8
Rate-scale representation from the A1 module.It is known that human hearing analyzes not only spectral contents but also temporal behaviors of the sound. In our auditory model, such ability is well characterized by the joint spectro-temporal modulation analysis performed by the second cortical module. In addition to spectral contents estimated in the first cochlear module, certain high-level features, such as speaking rate and FM sweeping directions, are well caught by the second cortical module. It has been shown that joint spectro-temporal modulations below 16 Hz and 8 cycle/octave well preserve the intelligibility of speech [31]. Not surprisingly, as shown in [32], the long-term averaged rate-scale energy pattern of speech falls roughly within these ranges. On the
other hand, rate-scale patterns of noises would differ from those of speech, indicating different high-level information between speech and noises. For example, Figure 3-9 shows auditory spectrograms ((a), (b)) and rate-scale energy representations ((c), (d)) of clean speech and white noise. This figure demonstrates that most of the spectro-temporal modulations of speech are within the range of rate=2-16 Hz and scale=0.5-8 cycle/octave, while the white noise has spectro-temporal modulations dominated at high rates and high scales.
Frequency (Hz)
Time (ms)
(a) Come home right away.
200 400 600 800 1000
125
200 400 600 800 1000
125
FIGURE 3-9
Auditory spectrograms of (a) clean speech, and (b) white noise.Rate-scale representations (with rate and scale in x- and y- axis) of (c) clean speech, and (d) white noise.
Accordingly, a noise suppression algorithm by the joint spectro-temporal modulation filtering (STMF) is proposed in [29]. For an input noisy speech, spectro-temporal modulations only within 2~32 Hz and 0.5~8 cycle/octave are kept in the STMF process and a cleaner spectrogram is generated:
5 1
2 32,0.5 8
( , ) ( , , , )
tf( , ; , )
y t f r t f STIR t f
ω
ω
∗ω
± ≤ ≤± ≤Ω≤
= ∑ Ω ∗ − − Ω
(3-9) where STIR t f1( , ; , )ω Ω is the normalization of STIR t f( , ; , )ω Ω .
Figure 3-10 demonstrates procedures of our STMF noise suppression algorithm.
The noisy auditory spectrogram is passed through the STMF process. Then, a simple threshold δ (a certain percentile of the maximum value of the cleaned spectrogram) is used to determine the speech versus non-speech regions in the cleaned spectrogram.
The threshold δ bears the trade-off between effects of speech distortion and noise suppression. Finally, a α― 1 template (α for non-speech regions and 1 for speech regions) is generated and multiplied with the original noisy spectrogram to produce a noise-suppressed spectrogram. ACCs are then derived from the noise-suppressed spectrogram for our speaker recognition simulations.
Frequency (Hz)
Time (ms)
Noisy speech in car noise with 5dB SNR
500 1000 1500 2000
Auditory spectrogram after STMF (y5)
500 1000 1500 2000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 125