Chapter 2 Literature Review, Related Work and Background Knowledge
2.2 Auditory Model
The proposed auditory features were extracted from stages of an auditory model, which is based on physiological evidences and consists of early cochlear (ear) and central cortical (A1) modules.
16
y1 y2 y3 y4
y5
Figure 2.2.1 Detail block diagrams of auditory model (feature extractor).
2.2.1 Cochlear Module
The cochlear module models the functions of the peripheral auditory system. The cochlea behaves like a frequency analyzer. As Fig. 2.2.1 shows, the cochlear module consists of a bank of 128 overlapping asymmetric constant-Q band-pass filters
Q B 4 that mimic the frequency selectivity of the cochlea. These filters are distributed evenly over 5.3 octaves with a 24 filters/octave frequency resolution. The output of each filter is fed into a non-linear compression stage and a lateral inhibitory network (LIN), and then processed by an envelope extractor (a half-wave rectifier followed by a low-pass filter). The non-linear high-gain compression models the saturation of the inner hair cells, which transduce the vibrations of the basilar membrane along the cochlea into intracellular hair cell potentials. The auditory nerve then transmits the hair cell potentials to the cochlear nucleus of the central auditory system. This transmission is simulated by the LIN, which generates a spectral profile by detecting discontinuities along the frequency axis. This is followed by integration over a few milliseconds. This study uses a simplified linear version of this module with a disabled hair cell stage. This approach normalizes all speech signals in advance to avoid the non-linear high-gain compression of the hair cells. As in Fig. 2.2.1, the outputs at different stages of this module can be written as:
y t, f s t h t, f (1) y t, f ∂ y t, f (2) y t, f max y t, f , 0 (3) y t, f y t, f µ t, τ (4)
where s t is the input speech, h t, f is the impulse response of the constant-Q cochlear filter with center frequency f, depicts the convolution in time, ∂ is the partial derivative along the f axis, the integration window µ t, τ e · u t with the time constant τ models the current leakage along the neural pathway to the
cochlear nucleus (midbrain), and u t is the unit step function.
The output y t, f is an auditory spectrogram that represents neuron activities along the time (t) and log-frequency (f) axis. The auditory spectrogram produced by this simplified linear cochlear module is similar to the magnitude response of a Mel-scaled FFT based spectrogram. The constant-Q criterion of the filter bank shares similar effects of the Mel-scale and the local envelope approximates the magnitude of a FFT based spectrogram. Note that the LIN accounts for the spectral masking effect provided that hair cells behave non-linearly. However, since this study does not consider the hair cell stage, the LIN only effectively sharpens the constant-Q cochlear filters.
2.2.2 Cortical Module and Rate-Scale Representation
The second module models the spectro-temporal selectivity of neurons in the auditory cortex (A1). The auditory spectrogram y t, f is further analyzed (filtered) by cortical neurons, which are modeled by two-dimensional filters tuned to different spectro-temporal modulation parameters (Chi et al. 2005). The rate (or velocity) parameter
(in Hz) reflects how fast the local spectro-temporal envelope varies along the temporal axis. The scale (or density) parameter (in cycle/octave) represents the distribution of the local spectro-temporal envelope along the log-frequency axis. In addition to the rate and the scale, cortical neurons are also sensitive to the sweeping direction of the FM of the sound. This module characterizes directional selectivity using the sign of the rate: negative for upward sweeping direction, and positive for downward sweeping direction.Therefore, the 4-dimensional output of this cortical module can be formulated as r t, f, ω, Ω y t, f STIR t, f, ω, Ω (5)
where STIR t, f, ω, Ω is the joint two-dimensional spectro-temporal impulse response (STIR) of the direction-selective filter tuned to ω and Ω, and is the two-dimensional convolution in the time and log-frequency domains. More detailed formulations and derivations of the STIR t, f, ω, Ω are available in (Chi et al. 2005).
The local energy of the four-dimensional output is then computed as E t, f, ω, Ω |r t, f, ω, Ω jH r t, f, ω, Ω | (6)
where H · is the Hilbert transform along the log-frequency (f) axis. From a
18
functional point of view, cortical neurons perform a joint spectro-temporal multi-resolution analysis (due to various rate-scale combinations) on the input auditory spectrogram. The excitation pattern of cortical neurons associated with a single time-frequency (T-F) unit at t , f of the input auditory spectrogram is referred to as the rate-scale (RS) representation of that particular T-F unit, and is expressed as E t, f, ω, Ω .
The frame-based RS representation of an utterance can be obtained by averaging the RS representations of T-F units over the frequency axis as follows:
P ω, Ω, t ∑ E t, f, ω, Ω (7)
The bottom panels of Fig. 2.2.2 show the time-varying RS representation P ω, Ω, t of a sample speech around 200 and 550 ms. Each plot of the RS representation clearly shows two attributes: (1) spectro-temporal modulations of envelopes and (2) resolved pitch below 512 Hz. Consider the 550 ms frame as an example. The resolved pitch around 230 Hz produces a strong response around the high rate high scale (pitch related) region. On the other hand, the envelopes of the almost flat harmonic structure shown at 230, 460, and 1150 Hz produces {low rate (due to the flatness, no FM), low scale (2 cycles/periods within 2.32 octave)} strong responses at regions less than 8 Hz and less than 1 cycle/octave. Since flat envelopes do not favor any sweeping directions, the {low rate, low scale} region exhibits symmetric rate responses. Figure 2.2.1 shows that the frame-based P ω, Ω, t encodes the information of the spectral-temporal structures, including but not limited to pitch, harmonicity, formant spacing, and AM and FM of an input sound at each time instant. Some of these structures, such as pitch, AM, and FM, are associated with the prosody of the sound, while others are associated with the spectral characteristics of the sound. Variations of these two types of features (prosodic and spectral features) commonly appear in speech emotion recognition researches (Cowie et al. 2001; Mozziconacci 2002;
Scherer 2003; New et al. 2003; Ververidis and Kotropoulos 2006; Schuller et al.
2007a; Busso et al. 2009). Therefore, the proposed time-varying RS representation could be a good candidate for speech emotion recognition.
The left and right panels of Fig 2.2.2 show the long-term averaged P ω, Ω, t of clean speech and white noise, respectively. The long-term averaged RS representation of clean speech shown in the Figure 2.2.2 was produced by extracting 30 clean utterances from the NOIZEUS corpus (Loizou 2007). Clearly, the white noise primarily affects the pitch region (> 128 Hz) of speech. In addition to the pitch region, speech possesses high energies in the low-scale low-rate region (< 4 cycle/octave, < 32 Hz), while white noise activates the high-rate high-scale region (>
2 cycle/octave, > 32 Hz) due to differences in the structures of their spectral-temporal envelopes. This indicates that local spectro-temporal speech envelopes are mostly smoother than white noise envelopes along either the time or the frequency axis.
These spectro-temporal envelopes critically encode the amplitude modulation and the frequency modulation of the sound, which are vital cues for humans to segregate individual sound streams from a sound mixture (Grimault et al. 2002; Carlyon et al.
2000). This segregation process of human hearing perception is very important to people’s daily lives, and is referred to as auditory scene analysis (ASA) (Bregman 1990). Since speech envelope modulation is critical to hearing perception and vastly different from white noise envelope modulation, this study uses the time-varying P ω, Ω, t , which decomposes modulations of local envelopes in a multi-resolution fashion, to assess speech emotions under noisy conditions.
Figure 2.2.2 Rate-scale representation of a speech frame.
20