• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

research work relating to three different types of SE techniques, namely speech denoising, speech dereverberation, and channel compensation. Next, the chapter discusses the work related to SE models proposed in more recent years using multimodal learning strategies.

In addition, model­based compression and quantization techniques to reduce the compu­

tational costs are discussed. Finally, the key research challenges involved in designing a robust SE system and the contribution of this dissertation are briefly discussed.

1.1 Background

1.1.1 Speech Denoising

In real­world applications, the level of background noise may significantly diminish the quality and intelligibility of a speech signal acquired by a microphone to the point that it becomes useless for subsequent processing [1]. Several single­channel SE methods have been proposed in the past to address noise reduction tasks. However, the perfor­

mance of SE in real acoustic environments is not always satisfactory, because improving intelligibility and quality concurrently is a challenging problem. A class of SE methods, termed spectral restoration, aims to design a filter or transformation that attenuates the noise components to generate clean speech. Notable techniques include the Wiener filter and its extensions [12, 13, 14], the minimum mean square error spectral estimator (MMSE) [15, 16, 17], the maximum a posteriori spectral amplitude estimator (MAPA) [18] [19], the maximum likelihood spectral amplitude estimator (MLSA) [20] [21], and generalized MAPA [22]. Another popular class of SE methods adopts speech models for SE. Notable examples include the harmonic model [23], the linear prediction (LP) model [24] [25], and the hidden Markov model (HMM) [26]. A common limitation of most of these conven­

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

tional methods is that they rely on either the additive nature of the background noise or the statistical properties of speech and noise signals. As a consequence, these methods fail to properly contrast the non­stationary noise of real­world scenarios in unexpected acoustic conditions.

Rather than assuming an explicit model, methods based on non­linear mapping have also been adopted to address noise reduction tasks. In such approaches, stereo training data is generally needed to learn a non­linear mapping function between noisy and clean speech. In the non­linear mapping category, artificial neural networks (ANN) have been shown to be a viable solution to effectively address background noise issues [27] [28].

For example, in [29], a single­hidden­layer with 160 neurons was employed to estimate the instantaneous signal­to­noise ratio (SNR) level of amplitude modulation spectrogram (AMS), and then the noise was suppressed according to the estimated SNRs of different channels. Alternatively, in [30, 31, 32], shallow ANNs were used to determine a map­

ping between the noisy and clean speech signals. Unfortunately, a lack of depth hindered comprehensive exploitation of the relationships between noisy and clean speeches. By leveraging a greedy layer­wise unsupervised learning algorithm [33], often referred to as pre­training [34], the training of deep neural networks (DNNs) can now be successfully designed, and the strong regression capabilities of deep models can be better explored.

For example, deep/stacked denoising autoencoders (DDAEs) were used to model the re­

lationship between clean and noisy features in [35] [36]. Deep recurrent neural networks and long­short term memory (LSTM) networks have also been adopted in feature en­

hancement [37] [38]. In [39], a deep belief network (DBN) with a restricted Boltzmann machine (RBM) was used to design a facial expression recognition (FER) system. Akhtar et al. [40] further exploited the performance of neural networks by generating a K­support

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

norm­based noise model, to train neural networks. Meanwhile, convolutional neural net­

works, which have a better capability of modeling local temporal­spectral structures of speech signals, have been adopted as a fundamental model for the SE task in [41], and a deeper structure of the convolutional neural network (DCNN) was used for hand gesture recognition in [42]. A common issue with ANN­based speech enhancers is the degraded performance in the presence of unexpected noise. A simple, yet effective solution to this problem is to cover many different types of noise in the training set, as proposed in [43].

In addition to ANN, a generalized single hidden layer feed­forward network (GSLFN) [44] has been proposed for regression problems in which the traditional single­layer feed­

forward network (SLFN) is extended by exploiting the polynomial functions of inputs as output weights. In [45], the universal enhancing capabilities of deep models were more thoroughly investigated. In particular, the authors proposed a regression DNN­based SE framework via training a deep and wide neural network architecture using a large collec­

tion of heterogeneous training data with four noise types.

1.1.2 Speech Dereverberation

Reverberation refers to the collection of reflected sounds from surfaces (e.g., walls and objects) in an acoustic enclosure. It has been shown to severely deteriorate the quality and intelligibility of speech signals for both human and machine listeners. Such a deteriora­

tion can substantially affect the performance of speech­related applications, for instance, ASR [46, 47, 48], speaker identification systems [49, 50, 51]. It can also severely hamper speech reception performance for both normal and hearing­impaired listeners [52] [53]. In the last few decades, numerous approaches have been proposed to solve the reverberation problem. The conventional speech dereverberation techniques can be categorized into

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

three main groups [54]. The first group referred to as source­model­based approaches, aims to separate the speech and reverberation based on the prior information of clean structures and room reverberation effects. Notable algorithms belonging to this category include the linear prediction (LP) methods [55, 56, 57], harmonic filtering techniques [58], and probabilistic models [59] [60]. Another group of algorithms is based on homomor­

phic transformation, in which the reverberated speech signals are analyzed in the cepstral domain to simply subtract the reverberation from the signal. Notable techniques include cepstral­based processing [61] and spectral subtraction [62]. The third group of algo­

rithms includes channel inversion and employs inverse filtering to deconvolve the speech convoluted with room impulse response (RIR) during reverberation. Notable techniques include the minimum mean square error (MMSE) [63], least square, beamforming [64], and matched filtering [65]. Recently, nonlinear spectral mapping approaches have been developed to address the reverberation problem. For these approaches, ANNs are gener­

ally used to `learn'the mapping function of the reverberated and anechoic speech [66].

More recently, the universal approximation capabilities of deeper structures have been extensively studied [67]. The outcome of those studies points out that deeper structures of neural networks enable strong learning capabilities, and the reverberation problem can be handled with success. For example, DDAEs were adopted to reconstruct the anechoic speech signal from the reverberated signal in [46] [68]. In [69] [70], LSTM­ and deep recurrent neural network (DRNN)­based dereverberation systems were proposed to effec­

tively reduce the reverberation effects by leveraging the current as well as past frames. In [71, 72, 73, 74, 75], DNN­based solutions have been proposed to improve performance of the system by training a deeper framework to obtain a mapping from the reverberated speech signal to an anechoic one.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

1.1.3 Channel Compensation

Different acoustic features of recording sensors in mobile and Internet of Things (IoT) devices can cause a major channel mismatch which is another common problem in the speech­related applications. In this dissertation, we next focus on the channel mismatch problem by considering the utterances recorded using two different microphones, i.e., air­

conducted microphone (ACM) and bone­conducted microphone (BCM), as a represen­

tative channel mismatch conditions. A number of filtering­based and probabilistic solu­

tions have been proposed in the past to convert low­quality BCM utterances to high­quality ACM utterances. In [76], the BCM utterances were passed through a designed reconstruc­

tion filter to improve quality. In [77] and [78], BCM and ACM utterances were combined for SE and ASR in non­stationary noisy environments. In [79], a probabilistic optimum filter (POF)­based algorithm was used to estimate the clean features from the combina­

tion of standard and throat microphone signals. Thang et al. [80] restored bone­conducted speech in noisy environments based on a modulation transfer function (MTF) and a linear prediction (LP) model. Later, Tajiri et al. [81] proposed a noise suppression technique based on non­negative tensor factorization using a body­conducted microphone known as a non­audible murmur (NAM) microphone.

1.1.4 Multimodal Speech Enhancement

Recent studies have shown that visual modality carries important information, such as lip motions and mouth articulations that can help discriminate similar speech sound in noisy conditions [82, 83, 84]. Recently, several SE methods that integrate audio and visual information have been proposed. For example, in [85] [86], fully­connected, and convo­

lutional neural network models were used to build an audio­visual SE system and have

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

improved the noise reduction performance successfully compared to audio­only frame­

works. In [87], the authors proposed a deep learning­based framework to investigate the impact of the Lombard effect on the performance of the audio­visual SE system. In [88], a speech separation system was proposed that incorporated audio­visual information using a deep network­based model.

More recently, model compression that aims to facilitate the use of deep models in real­

world applications has attracted considerable attention. Several model compression tech­

niques have been proposed to reduce computational costs without significantly degrading the achievable performance. In addition to the state­of­the­art performance achieved by the deep­learning­based techniques in different classification and regression tasks, a con­

siderable amount of research has been done on quantization­based model compression strategies to improve the computational capability of deep­learning­based systems for effi­

cient online learning without degrading much of system’s overall performance [89, 90, 91].