Chapter 1 Introduction
1.2 Motivation
Human-machine interface will be the killer application of next generation. Indeed, there are many people that not able to write but to speak. Also, many people would like to listen clearly instead of reading comprehensibly; apparently to the elder.
Therefore, speech enhancement is more and more important to our society with the increasing elder population.
Auditory models have been evolved from one-dimensional into multi-dimensional models. Therefore, auditory model based speech enhancement techniques should be built on the multi-dimensional auditory representation. The preliminary work done by Yung showed some significant achievements in speech recognition rate [11], hence we propose a subspace decomposition coupled with Yung’s method to further explore the robustness of the multi-dimensional speech enhancement technique.
Chapter 2
Literature Review
In this chapter, we briefly describe the auditory model and the subspace
decomposition algorithm utilized in this thesis. At first, the auditory model developed by Shamma et al. is introduced [9, 10, 12, 13]. Our proposed approach works on the representations from this auditory model. In section 2.2, we shortly review basic subspace algorithms for speech enhancement [5, 14, 15]. Finally, the supervector technique, which is used to express higher dimensional representations in our subspace decomposition, will be described concisely [16, 17].
2.1 Hearing Physiology
During past decades, the idea of adopting properties of human hearing in speech-related applications becomes more and more popular within the group of speech researchers. Here, we adopt a similar idea to study the speech enhancement in
an internal perceptual representation of an auditory model. Basic hearing physiology and the auditory model, which is proposed by Shamma et al, are introduced step by step in this section
2.1.1 Hearing Physiology
FIGURE 2-1 The anatomy of the ear.
(http://www.advcoch.com/I2_Hearing_Physiology.htm)
The ear could be divided into three parts – outer ear, middle ear and inner ear, and the anatomy of the ear is shown in figure 2-1.
The most important functions of the out ear are localization, amplification and protection. Because of the paired ears, we could use the phase delay and amplitude difference to judge the direction of sound source. Also, the ear canal is regarded as a filter that gives the largest gain at about 3,300 Hz.
The middle ear is the portion of the ear internal to the eardrum, external to the oval window of the cochlea. When the sound arrives at the eardrum, it is transferred from wave to vibration. By passing through the three ossicles, known as malleus, incus and stapes, the sound signal is conveyed to the oval window, the start of inner ear.
The cochlea in the inner ear plays a significant role in the auditory system. It consists of three chambers with full lymph, as shown in figure 2-2. By the time the mechanical vibration arrives the oval window, a traveling wave is generated and propagates along the basilar membrane (BM) of the cochlea. Different locations of the BM reach maximum responses in pertain to traveling waves with different
frequencies. The basilar membrane is about 35mm in length with its width increasing and elasticity decreasing progressively from base to apex. The left panel of figure 2-2 shows the diagram of basilar membrane and the right panel shows the maximum responsive frequencies along the basilar membrane. The range of resonance frequency is about 20-20,000 Hz, which is the audible frequency range of human being.
FIGURE 2-2 The basilar membrane diagram (left) and the characteristic frequency at the basilar membrane (right). (Hearing Physiology Handout, AAIP)
For a complex sound consisting of several frequencies, the overall pattern of the BM would be determined by resonances of all input frequency components. The mechanical inhibitions between neighboring frequencies on the BM might be the
main reason of the well-known “frequency masking” phenomenon of human audition.
The traveling wave generates displacement of the BM, then the hair cells distributed along the basilar membrane transform the displacement pattern to
corresponding pattern of sensory nerve action potentials. There are two different hair cells: inner hair cells (IHCs) and outer hair cells (OHCs). Most of the transformation from mechanical vibrations to electrical potentials is done by the help of IHCs, a kind of sensor connects with the auditory nerve. On the other hand, OHCs are often for the amplification/reduction of action potentials through the auditory nerve to protect the auditory sensory system. Due to the fact that a relaxation time is needed between consecutive fires of auditory neurons, firing rates can not keep up with high frequency components, as demonstrated in Figure 2-3. Firing rates of IHCs are bounded by 4-5k Hz and rates of the midbrain are bounded by about 1k Hz.
FIGURE 2-3 The firing rate of auditory nerve correspond to the monotone audio input. (Hearing Physiology Handout, AAIP)
2.1.2 Spectrum Estimation of Auditory Perceptual Model
The first stage of the auditory perceptual model is to simulate the sound pathway from the cochlea, hair cells and auditory nerves to the midbrain. It is divided into three substages – analysis stage, transduction stage and reduction stage, as shown in figure 2-4.
FIGURE 2-4 The diagram of first stage of auditory model. (Auditory Model Handout, AAIP)
The cochlea is often thought as a frequency analyzer, hence modeled by a bank of 128 constant-Q bandpass filters in the analysis stage. Figure 2-5 shows a filterbank consisting of 128 IIR filters uniformly distributed among 5.3 octaves with 24
filters/octave frequency resolution. The bandwidth and the center frequency of each filter satisfy the following equation:
Q bandwidth
fcenter = (2-1) where Q is a constant (= 4) in our implementation. It is obviously that with the
center frequency increasing, the corresponding bandwidth is increasing gradually.
This property describes the general idea that the cochlea possesses higher frequency resolution (i.e., narrower bandwidth) at low frequency regions than at high frequency regions.
FIGURE 2-5 The filterbank consists of 129 filters which conforms to Q
bandwidth
fcenter = .
In the analysis stage, outputs of the cochlear filterbank can be represented by the following equation:
( ) ( ) ( )
t x st ht xycoch , = ⊗ , (2-2) where x encodes the location of a particular cochlear filter along the BM (i.e., the log-frequency axis from engineering point of view) and h ,
( )
t x are impulse responses of the filterbank.The transduction stage then models the behaviors of inner hair cells including (1) the transduction of the traveling pressure to the velocity in the lymph; (2) the neural saturation and (3) current leakages. This stage can be formulated as
( )
t x g(
y( )
t x) ( )
tyAN , = ∂t coch , ⊗ω (2-3)
where ∂ models the transduction of the hydraulic pressure to velocity; the sigmoid t function g is used to simulate the neural saturation as follows:
( )
u(
e u)
g = 11 + − (2-4) and the low-pass function ω is used to account for current leakages of auditory
( )
t neurons.The last reduction stage addresses two important observations in the auditory sensory system: (1) the lateral inhibition of auditory neurons, which might account for the frequency masking phenomenon shown in human hearing; and (2) the observed temporal dynamics reduction from the cochlea to the midbrain. The following two equations are formulated in the auditory model we used.
( )
t,x max(
y( )
t,x,0)
yLIN = ∂x AN (2-5)
( )
t,x µ( )
t;τ yyfinal = LIN ⊗ (2-6) where the first-order derivative ∂xyAN
( )
t,x simply approximates the lateralinhibition between neighboring neurons, the half-wave-rectifier puts the constraint on the negative potential, and the low-pass filter µ
( )
t;τ =e−t/τ ⋅u( )
t with a timeconstant τ models the temporal dynamics reduction of the midbrain.
The output of these three stages is a two-dimensional representation in the spectral (log-frequency) and temporal domain and is referred to as the auditory
spectrogram [12]. Yung’s study showed features extracted from auditory spectrograms are more robust in speech recognition tasks [11]. One example of the auditory
spectrogram is shown in figure 2-6.
FIGURE 2-6 An example of wav2aud using sentence “come home right away”.
2.1.3 Cortical Analysis
The processing of generating the auditory spectrogram, an estimate of the spectrum by the inner ear, is introduced in the previous section. Furthermore, neurophysiological evidences reveal that neurons in the higher auditory cortex (AI) respond to different frequencies as well as to temporal structures of patterns generated by inner ears. In other words, AI’s neurons exhibit different spectro-temporal tunings and can be characterized by Spectro-Temporal Receptive Fields (STRFs), which can be considered as spectro-temporal two-dimensional impulse responses from
engineering perspectives. To measure the 2D impulse responses of neurons in AI, one has to use orthogonal basis signals in the spectro-temporal domain to drive the cortex.
Such spectro-temporal basis signals are so called moving ripple stimuli. Figure 2-7 shows one example of the moving ripple stimulus of rate=+4 (Hz, the temporal
velocity in time) and scale=0.5 (cycle/octave, the density in log-frequency). In addition to the rate and scale parameters, directional selectivity of the FM sweep is encoded by the sign of the rate parameter, in which positive sign of rate represents the downward direction, i.e., frequency decreasing with time, and negative sign
represents the upward direction.
FIGURE 2-7 An example of moving ripple stimulus.
( Auditory Model Handout, AAIP)
By measuring impulse responses of many neurons, researchers conclude different AI’s neurons roughly tune to combinations of different rate, scale and direction. Therefore, the auditory cortex can be modeled as a bank of 2D bandpass filters to analyze the input 2D auditory spectrogram. The schematic plot in figure 2-8 demonstrates the 2D cortical filtering of AI on a sample spectrogram. The small top panels on each subplot are the impulse responses of different typical neurons tuning to slow/fast rates and coarse/fine scales. The bottom panels are outcomes of these 2D spectro-temporal filters.
Overall outputs of the 2D filtering construct a four-dimensional representation (in rate, scale, log-frequency and time), which is hard for illustration. Therefore, we integrate the 4D output along both spectral and temporal axes to generate an energy pattern on the remaining rate-scale axes. Figure 2-9 shows auditory spectrograms ((a), (b)) and rate-scale energy representations ((c), (d)) of clean speech and white noise.
This figure demonstrates that most of the spectro-temporal modulations of speech are within the range of rate=2-16 Hz and scale=0.5-4 cyc/oct, while the white noise has modulations distributed to high rates and all possible scales.
FIGURE 2-8 The response for 8 basic nerves in the cortex. (Auditory Model Handout, AAIP)
FIGURE 2-9 (a) clean speech. (b) white noise. (c) clean speech in rate-scale domain with rate and scale in x- and y- axis. (d) white noise in rate-scale domain.
2.2 Basic Subspace Algorithms in Speech Enhancement
There are many speech enhancement algorithms, such as spectral subtraction [1, 2], Wiener filtering [3] and statistical-model-based method [4]. In this study, a
subspace decomposition algorithm based on linear algebra theory is utilized and introduced in this section. Subspace algorithms suppress noise by including signal components falling in “speech” space while excluding components in the “noise”
space. In this section, we first introduce the time-domain linear optimal estimator
which minimizes the speech distortion from white noise under certain constraints.
Next, the colored noise, which is similar to the real noise around us, will be considered in our algorithm.
2.2.1 Time-Domain Constrains
Consider the noisy speech signal y =x+d containing samples of clean speech x and noise d. The cross-correlation matrix of y (of length K ) is defined as:
semi-definite, assuming x and d are wide-sense stationary signals. We postulate that the signal and the noise vectors are uncorrelated and zero mean, then the preceding equation can be reduced to:
d x
y R R
R = + (2-8)
where Rx ≡E
[
x⋅xT]
and Rd ≡ E[
d⋅dT]
are the auto-correlation matrices of the signal and noise, respectively. If we further assume that the noise is white, the noise correlation matrix will be diagonal and the equation (2-8) can be rewritten as:( )
where εx represents the speech distortion, and εd represents the residual noise.
Next we define the energy of εx and εd as:
Thus we can obtain the optimum linear estimator by solving the following time-domain constrained problem:
where ζ is a positive constant. This constrained optimization problem can be solved as in [18]:
(
+ ⋅)
−1= x x d
opt R R R
H µ (2-13) where µ is the Lagrange multiplier. The formula of this optimal estimator
Hopt is similar to the formula of the Wiener filter when µ =1. The major difference is that Hopt works on the time domain, on the other hand, the Wiener filter performs on the frequency domain. In addition, the constant µ gives us lots of degrees of freedom in designing our estimator.
Furthermore, equation (2-13) can be simplified by using eigen-decomposition of
T
where Λ is a opt K× diagonal matrix given by: K
(
Λ + ⋅ 2 ⋅)
−1Λ
=
Λopt x x µ σd I (2-15)
2.2.2 Pre-whitening for Colored Noise
Only white noise with diagonal correlation matrix is considered in the previous section. However, in practical world, background noises are seldom white, but colored instead. A simple way to deal with colored noises is to transform them to white noises by a pre-whitening process which is introduced in this section.
The correlation matrixR of noise, which can be extracted from the speech d absent segments, is factorized by the Cholesky factorization:
T T
d R R L L
R = = ⋅ (2-16) where L is a unique lower triangular K× matrix. Multiplying the pre-whitening K matrix L to the equation (2-8) yields: −1 where d ′becomes white after the pre-whitening procedure.(See Appendix I for the proof.) Therefore, the correlation matrix R of the noisy speech can be rewritten y' as:
After deriving the linear estimator of x' as mentioned in the previous section, we should multiply L to the estimator ˆx' to have the post-whitening estimator ˆx. These procedures can be formulated as:
y L H L
xˆ = ⋅ '⋅ −1 (2-19) where H , the optimal estimator solution for pre-whitening elements as in equation ' (2-18), has the same form as the Hopt in equation (2-13).
The noise correlation matrix is not diagonal since U, the eigenvector matrix of R , diagonalizes x R not x R . It is shown [19] that there exists a matrix d V which
where ∆ and x V are the eigenvalues matrix and eigenvector matrix respectively of
x
d R
R −1
=
Σ . Note that the eigenvector matrix V is not orthogonal. Hence, we can rewrite the optimal linear estimator from equation (2-15) as:
(
x)
Tx
opt V I V
H = −1∆ ∆ +µ⋅ −1 (2-21)
2.3 Supervector : 2D image processing
Many perceptual properties in hearing and in vision share similar sensory
mechanisms [20]. For example, the principles to group sounds from a spectrogram are the same principles to group objects from an image. Therefore, in this study, we treat the speech enhancement in spectrograms as a 2D image enhancement problem. The most common technique in 2D image enhancement is using the supervector technique to transform the 2D task into a 1D task, as shown in some eigenface studies [16, 17].
In image processing applications, the pattern of N by N elements is usually rearranged to a vector of 1 by N2. This implies that characteristics of a N×N matrix are equal to those of a 1 N× 2 vector, as shown in figure 2-10.
FIGURE 2-10 The realignment diagram showing the transition of 2D to 1D.
Chapter 3
Subspace Decomposition of Perceptual Representations for Speech Enhancement
The auditory model and the basic subspace algorithm were described in Chapter 2.
The subspace decomposition of perceptual representations will be fully expressed in this chapter.
3.1 Introduction
Most speech processing algorithms are developed in either temporal domain (channel by channel) or in spectral domain (frame by frame). However, from neuro-physiological evidence, human brain analyzes speech in a joint
spectro-temporal fashion of considering temporal dynamics with spectral contents at the same time. Our approach of taking the joint spectro-temporal domain into
understand speech in noisy environments merely because of significant differences shown in spectro-temporal structures between speech and noise, as in figure 3-1.
Following this concept, we propose the subspace decomposition algorithm in the joint spectro-temporal domain to extract speech-related features.
FIGURE 3-1 The auditory spectrogram of the clean speech (left) and the speech with 0dB car noise (right).
The spectro-temporal auditory representation used in this study was proposed in [9]. As pointed out in [9], the four-dimensional cortical impulse response is given by:
(
x,t; ,ω)
RF(
x; ,φ)
h(
t;ω,θ)
STRF Ω = Ω ⋅ IR (3-1) where RF
( )
x is the response field along the log-frequency (tonotopic) axis, hIR( )
tis the temporal impulse response. It has been shown that most of the modulations of speech signals fall in the range of rate = 2~16 Hz, scale = 0.5~8 cyc/oct [11]. Thus, we would use modulations within those ranges to extract spectro-temporal structures of speech in our enhancement application as:
( )
The Spectro-Temporal Cortical Response STCRΩ,ω
( )
x,t within speech regionscan then be written as:
Ω = cycle/octave. Next, we adopt the subspace decomposition via the supervector technique to each STCR separately.
As shown in figure 2-10, we transfer each 2D STCR to a 1D vector, i.e.,
(
M ×N)
⇒avector(
M⋅N×1)
Transferring a 2D matrix to a 1D vector is a conventional way to allow us applying the subspace decomposition to the perceptual representation STCR.In the proposed subspace decomposition approach, better or worse noise estimate would definitely affect the enhancement result. In this study, we do not treak around this issue and roughly estimate the noise from a few ms at the beginning of the input signal, which will be described in the next section.
Figure 3-3 illustrates signal flows of our proposed algorithm. Panel (a), (b) and (c) shows the original time domain waveform, the original auditory spectrogram and the spectro-temporal modulation energies at different (rate, scale) combinations, respectively. Panel (d) shows filtered spectro-temporal responses ST within speech regions and the enhanced responses by our proposed subspace decomposition
algorithm is shown in panel (f). Panel (e) shows the enhanced spectrogram by reconstruction of responses from (d), modulations of speech only [11]. Furthermore, panel (g) shows the final enhanced spectrogram by reconstruction of all enhanced responses in (f) from (d).
FIGURE 3-2 Flowchart of the proposed algorithm.
3.2 The 2D Neural Patterns in the Cortex
Equation (3-3) indicates the speech region in the cortical domain. Figure 3-3 shows STCRs in rate=1, 2, 4, scale=0.5, 1, 2, 4 combinations. It is noteworthy that (1) the lower the rate, the more time delay the STCR shows; (2) from the sampling theory, the upper bound of scale to avoid aliasing is 12 for the 24 samples per octave
sampling in scale axis. In this section, we will discuss several issues related to the proposed algorithm, including (1) reduction of the computation and (2) a simple estimation of noise.
FIGURE 3-3 The STCRs of clean speech from fig 3-1 (left). Top to bottom are rate=
-4, -2, -1, +1, +2, +4 and left to right are scale= 0.5, 1, 2, 4 respectively.
3.2.1 Dimension Redundancy Problem
Due to the high dimension of our spectrogram, our eigen-decomposition algorithm inherits much heavier computation than other speech enhancement algorithms, such as spectral-subtraction and Wiener filtering. To tackle such a problem, we can (1) reduce the dimension of the spectrogram or (2) partition the whole spectrogram into smaller segments for eigen-decomposition.
According to the sampling theory, bandwidth can be saved by down sampling the low-passed signals which has no high frequency components. Theoretically, in
log-frequency dimension, we could downsample 3 times in the scale=4 cyc/oct channel since the upper bound of scale is 12 cyc/oct. However, in practice, we use less aggressive multiply numbers to avoid any possible aliasing. Table 3-1 shows the downsample multiply we use for channels at certain scales.
scale (cyc/oct) 0.5 1 2 4 8
downsample multiply 8 8 4 2 1
Table 3-1 The downsample multiply for scales.
For the same reason, in temporal dimension, we could downsample 25 times in the rate=2 Hz channel since the upper bound of rate is 50 Hz. Table 3-2 shows the downsample multiply we use corresponding to various rates.
rate (Hz) 2 4 8 16
downsample multiply 4 4 2 1
Table 3-2 The downsample multiply for rates.
Figure 3-4 shows original and downsampled versions of STCRs at various (rate, scale) combinations with downsample multiply as in Table 3-1 and 3-2. In the
extreme case of rate=2 Hz and scale=0.5 cyc/oct, the size of the downsampled ST is reduced to 1/32 times of the original size. This downsampling dramatically decreases the overall computation.
FIGURE 3-4 Examples of downsampled STCRs at various (rate, scale) combinations.
Left column are the original STCRs and right column are the downsampled STCRs.
3.2.2 Frequency Band Division
In this work, we define four consecutive frames as a 40 ms “block” to be our 2D processing unit. In addition to downsampling the size of STCRs as mentioned in previous section, we further divide the processing unit along the frequency axis into several smaller units to reduce the computation. Another motivation of doing this is to
match hearing perceptions about frequency weighting. Dividing frequency bands in our auditory spectrogram might gives us the flexibility of adjusting parameters in each band to fit certain noise sources, for instance, car noise in specific bands. However, more detailed study on frequency weighting is beyond the scope of this work. Here,
match hearing perceptions about frequency weighting. Dividing frequency bands in our auditory spectrogram might gives us the flexibility of adjusting parameters in each band to fit certain noise sources, for instance, car noise in specific bands. However, more detailed study on frequency weighting is beyond the scope of this work. Here,