國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
research work relating to three different types of SE techniques, namely speech denoising, speech dereverberation, and channel compensation. Next, the chapter discusses the work related to SE models proposed in more recent years using multimodal learning strategies.
In addition, modelbased compression and quantization techniques to reduce the compu
tational costs are discussed. Finally, the key research challenges involved in designing a robust SE system and the contribution of this dissertation are briefly discussed.
1.1 Background
1.1.1 Speech Denoising
In realworld applications, the level of background noise may significantly diminish the quality and intelligibility of a speech signal acquired by a microphone to the point that it becomes useless for subsequent processing [1]. Several singlechannel SE methods have been proposed in the past to address noise reduction tasks. However, the perfor
mance of SE in real acoustic environments is not always satisfactory, because improving intelligibility and quality concurrently is a challenging problem. A class of SE methods, termed spectral restoration, aims to design a filter or transformation that attenuates the noise components to generate clean speech. Notable techniques include the Wiener filter and its extensions [12, 13, 14], the minimum mean square error spectral estimator (MMSE) [15, 16, 17], the maximum a posteriori spectral amplitude estimator (MAPA) [18] [19], the maximum likelihood spectral amplitude estimator (MLSA) [20] [21], and generalized MAPA [22]. Another popular class of SE methods adopts speech models for SE. Notable examples include the harmonic model [23], the linear prediction (LP) model [24] [25], and the hidden Markov model (HMM) [26]. A common limitation of most of these conven
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
tional methods is that they rely on either the additive nature of the background noise or the statistical properties of speech and noise signals. As a consequence, these methods fail to properly contrast the nonstationary noise of realworld scenarios in unexpected acoustic conditions.
Rather than assuming an explicit model, methods based on nonlinear mapping have also been adopted to address noise reduction tasks. In such approaches, stereo training data is generally needed to learn a nonlinear mapping function between noisy and clean speech. In the nonlinear mapping category, artificial neural networks (ANN) have been shown to be a viable solution to effectively address background noise issues [27] [28].
For example, in [29], a singlehiddenlayer with 160 neurons was employed to estimate the instantaneous signaltonoise ratio (SNR) level of amplitude modulation spectrogram (AMS), and then the noise was suppressed according to the estimated SNRs of different channels. Alternatively, in [30, 31, 32], shallow ANNs were used to determine a map
ping between the noisy and clean speech signals. Unfortunately, a lack of depth hindered comprehensive exploitation of the relationships between noisy and clean speeches. By leveraging a greedy layerwise unsupervised learning algorithm [33], often referred to as pretraining [34], the training of deep neural networks (DNNs) can now be successfully designed, and the strong regression capabilities of deep models can be better explored.
For example, deep/stacked denoising autoencoders (DDAEs) were used to model the re
lationship between clean and noisy features in [35] [36]. Deep recurrent neural networks and longshort term memory (LSTM) networks have also been adopted in feature en
hancement [37] [38]. In [39], a deep belief network (DBN) with a restricted Boltzmann machine (RBM) was used to design a facial expression recognition (FER) system. Akhtar et al. [40] further exploited the performance of neural networks by generating a Ksupport
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
normbased noise model, to train neural networks. Meanwhile, convolutional neural net
works, which have a better capability of modeling local temporalspectral structures of speech signals, have been adopted as a fundamental model for the SE task in [41], and a deeper structure of the convolutional neural network (DCNN) was used for hand gesture recognition in [42]. A common issue with ANNbased speech enhancers is the degraded performance in the presence of unexpected noise. A simple, yet effective solution to this problem is to cover many different types of noise in the training set, as proposed in [43].
In addition to ANN, a generalized single hidden layer feedforward network (GSLFN) [44] has been proposed for regression problems in which the traditional singlelayer feed
forward network (SLFN) is extended by exploiting the polynomial functions of inputs as output weights. In [45], the universal enhancing capabilities of deep models were more thoroughly investigated. In particular, the authors proposed a regression DNNbased SE framework via training a deep and wide neural network architecture using a large collec
tion of heterogeneous training data with four noise types.
1.1.2 Speech Dereverberation
Reverberation refers to the collection of reflected sounds from surfaces (e.g., walls and objects) in an acoustic enclosure. It has been shown to severely deteriorate the quality and intelligibility of speech signals for both human and machine listeners. Such a deteriora
tion can substantially affect the performance of speechrelated applications, for instance, ASR [46, 47, 48], speaker identification systems [49, 50, 51]. It can also severely hamper speech reception performance for both normal and hearingimpaired listeners [52] [53]. In the last few decades, numerous approaches have been proposed to solve the reverberation problem. The conventional speech dereverberation techniques can be categorized into
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
three main groups [54]. The first group referred to as sourcemodelbased approaches, aims to separate the speech and reverberation based on the prior information of clean structures and room reverberation effects. Notable algorithms belonging to this category include the linear prediction (LP) methods [55, 56, 57], harmonic filtering techniques [58], and probabilistic models [59] [60]. Another group of algorithms is based on homomor
phic transformation, in which the reverberated speech signals are analyzed in the cepstral domain to simply subtract the reverberation from the signal. Notable techniques include cepstralbased processing [61] and spectral subtraction [62]. The third group of algo
rithms includes channel inversion and employs inverse filtering to deconvolve the speech convoluted with room impulse response (RIR) during reverberation. Notable techniques include the minimum mean square error (MMSE) [63], least square, beamforming [64], and matched filtering [65]. Recently, nonlinear spectral mapping approaches have been developed to address the reverberation problem. For these approaches, ANNs are gener
ally used to `learn'the mapping function of the reverberated and anechoic speech [66].
More recently, the universal approximation capabilities of deeper structures have been extensively studied [67]. The outcome of those studies points out that deeper structures of neural networks enable strong learning capabilities, and the reverberation problem can be handled with success. For example, DDAEs were adopted to reconstruct the anechoic speech signal from the reverberated signal in [46] [68]. In [69] [70], LSTM and deep recurrent neural network (DRNN)based dereverberation systems were proposed to effec
tively reduce the reverberation effects by leveraging the current as well as past frames. In [71, 72, 73, 74, 75], DNNbased solutions have been proposed to improve performance of the system by training a deeper framework to obtain a mapping from the reverberated speech signal to an anechoic one.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
1.1.3 Channel Compensation
Different acoustic features of recording sensors in mobile and Internet of Things (IoT) devices can cause a major channel mismatch which is another common problem in the speechrelated applications. In this dissertation, we next focus on the channel mismatch problem by considering the utterances recorded using two different microphones, i.e., air
conducted microphone (ACM) and boneconducted microphone (BCM), as a represen
tative channel mismatch conditions. A number of filteringbased and probabilistic solu
tions have been proposed in the past to convert lowquality BCM utterances to highquality ACM utterances. In [76], the BCM utterances were passed through a designed reconstruc
tion filter to improve quality. In [77] and [78], BCM and ACM utterances were combined for SE and ASR in nonstationary noisy environments. In [79], a probabilistic optimum filter (POF)based algorithm was used to estimate the clean features from the combina
tion of standard and throat microphone signals. Thang et al. [80] restored boneconducted speech in noisy environments based on a modulation transfer function (MTF) and a linear prediction (LP) model. Later, Tajiri et al. [81] proposed a noise suppression technique based on nonnegative tensor factorization using a bodyconducted microphone known as a nonaudible murmur (NAM) microphone.
1.1.4 Multimodal Speech Enhancement
Recent studies have shown that visual modality carries important information, such as lip motions and mouth articulations that can help discriminate similar speech sound in noisy conditions [82, 83, 84]. Recently, several SE methods that integrate audio and visual information have been proposed. For example, in [85] [86], fullyconnected, and convo
lutional neural network models were used to build an audiovisual SE system and have
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
improved the noise reduction performance successfully compared to audioonly frame
works. In [87], the authors proposed a deep learningbased framework to investigate the impact of the Lombard effect on the performance of the audiovisual SE system. In [88], a speech separation system was proposed that incorporated audiovisual information using a deep networkbased model.
More recently, model compression that aims to facilitate the use of deep models in real
world applications has attracted considerable attention. Several model compression tech
niques have been proposed to reduce computational costs without significantly degrading the achievable performance. In addition to the stateoftheart performance achieved by the deeplearningbased techniques in different classification and regression tasks, a con
siderable amount of research has been done on quantizationbased model compression strategies to improve the computational capability of deeplearningbased systems for effi
cient online learning without degrading much of system’s overall performance [89, 90, 91].