SERIL: Noise Adaptive Speech Enhancement using Regularization-based Incremental Learning

(1)

SERIL: Noise Adaptive Speech Enhancement using Regularization-based Incremental Learning

Chi-Chang Lee

^1,2

, Yu-Chen Lin

^1,2

, Hsuan-Tien Lin

¹

, Hsin-Min Wang

³

, Yu Tsao

²

1Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

2Research Center for Information Technology Innovation, Academia Sinica, Taiwan

3Institute of Information Science, Academia Sinica, Taiwan

[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Numerous noise adaptation techniques have been proposed to fine-tune deep-learning models in speech enhancement (SE) for mismatched noise environments. Nevertheless, adaptation to a new environment may lead to catastrophic forgetting of the previously learned environments. The catastrophic forgetting issue degrades the performance of SE in real-world embedded devices, which often revisit previous noise environments. The na- ture of embedded devices does not allow solving the issue with additional storage of all pre-trained models or earlier training data. In this paper, we propose a regularization-based incremental learning SE (SERIL) strategy, complementing existing noise adaptation strategies without using additional storage. With a regularization constraint, the parameters are updated to the new noise environment while retaining the knowledge of the previous noise environments. The experimental results show that, when faced with a new noise domain, the SERIL model outperforms the unadapted SE model. Meanwhile, compared with the current adaptive technique based on fine-tuning, the SERIL model can reduce the forgetting of previous noise environments by 52%. The results verify that the SERIL model can effectively adjust itself to new noise environments while overcoming the catastrophic forgetting issue. The results make SERIL a fa- vorable choice for real-world SE applications, where the noise environment changes frequently.

Index Terms: Speech enhancement, incremental learning, life- long learning, noise adaptation, catastrophic forgetting

1. Introduction

The objective of speech enhancement (SE) is to transform low- quality speech signals into enhanced-quality speech signals [1].

In many speech-related applications such as automatic speech recognition (ASR) [2] and speech emotion recognition [3], SE is used as a preprocessor to remove noise components from speech signals. In many portable or assistive-hearing devices, such as mobile phones [4], hearing aids [5], and cochlear im- plants [6], SE is crucial for increasing speech intelligibility and quality in noise environments.

In the past few years, deep learning (DL)-based models have been widely used for SE [7–15]. Various deep neural networks such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) have been used as fundamental models in SE systems. In these systems, some metrics are defined to measure the distance between the enhanced output and the clean reference, and the DL models are trained to minimize the distance. The L1 and L2 (mean-square-error) losses are commonly used because of their ease of computation and differentiability. However, these two

losses may not be optimal for specific tasks, and thus other metrics have been used as the loss to train the DL models [16, 17].

In addition to model types and loss functions, another im- portant consideration for the success of an SE system is its ability to adapt to new environments, particularly when deployed in embedded devices. In real-world situations, the noise in the testing environment is unseen in the training set; moreover, the noise types often vary over time. The mismatch between training and testing environments can significantly degrade the performance of SE. Therefore, identifying an approach that can efficiently and effectively adapt DL models to new testing conditions and improve the performance of SE is necessary. Thus far, several domain adaptation approaches [18–21] have been proposed to address the training-testing acoustic mismatch issue, which is also known as the domain shift problem. Although noise-adapted models can provide improved SE results for these conventional approaches, they often suffer from a catastrophic forgetting effect [22, 23]. In other words, when DL models adapt to a new noise environment, they usually perform poorly when dealing with previously adapted noise environments.

In this paper, we propose a regularization-based incremental learning strategy for adapting DL-based SE models to new environments (speakers and noise types) while handling the catastrophic forgetting issue. The proposed method is termed SERIL. SERIL exploits the advantages of two well-known incremental learning algorithms: (1) whole past optimization path information [24] and (2) curvature-based strategy [25]. We evaluated SERIL using two datasets: the Voice Corpus Bank corpus (VCB) [26] and the TIMIT corpus [27], which were used to form the training and testing sets, respectively. The overall SERIL included two phases: offline and online. In the offline phase, we first trained the DL model on the utterances from the VCB corpus with 13 different types of noise. In the online phase, SERIL first adapted the pre-trained model based on a small amount of adaptation data; then, the adapted model was used for SE. A direct fine-tuning model adaptation approach was implemented for comparison. Experimental results show that SERIL and the direct fine-tuning approach both effectively adapt the SE model to new environments and improve SE performance, compared with the pre-trained DL model without adaptation. Moreover, compared to the direct fine-tuning approach, SERIL maintained high SE performance against all previously learned types of noise, thus effectively addressing the catastrophic forgetting problem.

The remainder of this paper is organized as follows. Section 2 presents some related work and explains the motivation of using incremental learning strategy to help noise adaptation issue on speech enhancement. In Section 3, we detail the philoso- phies of the proposed SERIL system. The experimental setup and results are then reported in Section 4. Finally, Section 5 presents some concluding remarks.

(2)

2. Related Work and Motivation

An intuitive SE method to overcome the mismatch problem is to collect as many types of noise as possible to increase the gen- eralization ability [14]. However, it is impractical to cover the infinite types of noise that may be encountered in real situations.

Several researches [20, 21] have been proposed to directly fine- tune a pre-trained model to improve the performance in a target domain. When entering a new circumstance, these algorithms only focus on the current noise domain, and ignore the memory of the previously learned noise types. In many applications, such as edge-devices, the type of noise changes frequently, and it is common to re-encounter learned types of noise. However, the adapted SE model cannot perform well in the previously learned noise types. This effect is called catastrophic forgetting [22, 23]. Although the SE model can be fine-tuned every time the environment is changed, the repeated model adaptation process will result in high computation and time costs.

Figure 1: Relationship between fine-tuning and incremental learning from source noise domain to unseen target domain.

The above limitations of adaptive methods based on direct fine-tuning motivated us to apply the incremental learning algorithm to SE. Incremental learning is also known as continuous learning or life-long learning. Figure 1 illustrates the relationship between direct fine-tuning and incremental learning. Training trajectories are illustrated in a schematic parameter space, with parameter regions leading to good performance on the source (yellow region) and target (blue region), denoted as tasks S and T , respectively. After learning in task S, the parameters are located in θS. As shown by the dashed arrow in Figure 1, when the SE model is adapted by taking gradient steps to minimize the loss based on task T alone, the resulting θT is beyond the good performance area of Task S, i.e., what is al- ready learned in Task S is forgotten. In contrast, in incremental learning, the SE model weights are updated to the target domain while retaining the knowledge learned from the source domain.

This is often realized by finding the overlapping region of the source and target domains. The learning trajectory of incremental learning shown by the solid arrow in Figure 1 illustrates this concept. In this way, incremental learning can help the resulting model provide good SE results in the target domain while maintaining satisfactory performance in the source domain.

3. The SERIL System

3.1. Architecture and loss function of the SERIL system The architecture of the SERIL system is depicted in Figure 2.

The system performs SE in the spectral domain. Speech wave- forms are first converted into time-frequency features using a 512-point short-time Fourier transform (STFT) with a hamming window size of 32 ms and a hop size of 16 ms. Each feature vector consists of 257 elements. The enhanced spectral features are then converted into the waveform domain by inverse STFT with an overlap-add method. In the SERIL system, the first 3 layers are LSTM layers (one-directional LSTM was used for achiev- ing real-time inference). The hidden dimension of each LSTM is 257. A fully connected layer is concatenated to the output of the last LSTM layer for scaling.

Figure 2: Architecture of the SERIL system using the short-time spectral amplitude SDR (SDR^{ST SA}) as the loss function.

As mentioned earlier, the L1 and L2 norms are commonly used as the loss function to train DL-based SE models. In this study, we derived another loss function based on the short-time spectral amplitude SDR (SDR^{ST SA}), which was shown to provide better results than L1 and L2 norms in our preliminary experiments. In a previous study, Kolbæk et al. [28] reported that using the time-domain SDR [29, 30] as the loss can help the SE models to achieve improved performance. Because the input and output of SERIL are both spectral features, we need to modify the original SDR loss to use it in the spectral domain. We note that SDR can be regarded as the energy ratio of enhanced speech projected on the clean speech space over enhanced speech projected on the orthogonal space of clean speech. By Parseval’s theorem [31] and the linear property of Fourier transform, the energy ratio in the time domain is equivalent to that in the time-frequency domain. Therefore, we define the (SDR^{ST SA}) as follows:

SDR^{ST SA}( ˆX, X) = 10log10

kαXk²

kαX − ˆXk². (1) Given the noisy spectral features, Y , the SE model aims to generate enhanced spectral features, ˆX. α is computed by (X · ˆX)/kXk², where X is the target clean spectral features. In addition, fθ(.) is equal to ˆX; thus, we denote our loss function

−SDR^{ST SA}(fθ(Y ), X) as lθ(Y ).

3.2. Curvature-based regularization strategy

Considering the losses in the previous and new acoustic environments, Lold and Lnew, respectively, the total loss can be formulated as:

L(θ) = Lnew(θ) + Lold(θ). (2) Because the training data of the previous environment is usually not accessible online, we cannot calculate Lold(θ). Instead, we can assume that the loss of the previous environment can be revealed from the learned SE model, θ. By approximating Lold

using the second-order Taylor expansion at θ = θ^∗, we have Lold(θ) ≈ Lold(θ^∗) + δθ^T∇θLold(θ^∗) +1

2δθ^TH(θ^∗)δθ, (3) where δθ is θ − θ^∗; H(θ^∗) is the Hessian matrix of Lold at θ = θ^∗; and Lold(θ^∗) is a constant. Because the elements in ∇θLold(θ^∗) are generally small enough to be ignored, we can obtain the approximate form as Lold(θ) ≈ ¹₂δθ^TH(θ^∗)δθ.

Similar to the elastic weight consolidation [25, 32], we ignore the cross terms in H(θ^∗) to improve computational efficiency.

The approximate form becomes

H(θ^∗) ≈ diag(EY ∼D_old[(∇θlθ(Y ))(∇θlθ(Y ))^T])|θ=θ^∗, (4) where Y is the speech sample from the previous environment Dold. Finally, substituting (3) and (4) into (2), we have

L(θ) ≈ Lnew(θ) + λX

i

Fθi(θi− θi^∗)², (5) where λ is a hyperparameter; i is the index of the parameters in the model; θiand θ^∗i are the i-th parameters in the current and

(3)

previous environments, respectively; and Fθ_i is the diagonal element of H(θ^∗). The intuitive interpretation of Fθ_i is the local curvature, which indicates the sensitivity that affects the performance of the previous acoustic environment.

Kolouri et al. [33] provided a different explanation for the geometric view of the regularization term, which can be ap- plied to our scenario. As θ → θ^∗, ¹₂kθ − θ^∗k²_F

θi can be in- terpreted as the expectation of the squared difference of the loss values of the training samples of the previous environment, i.e., EY ∼D_old[¹₂(lθ(Y ) − lθ^∗(Y ))²]. Similar to (3), the distance can be approximated byP

iFθ_i(θi− θ^∗i)², which is also derived by the second-order Taylor expansion of E^{Y ∼D}old[¹₂(lθ(Y ) − lθ^∗(Y ))²] at θ = θ^∗. Referring to [32–34], we apply the interpolation approach to the case of multiple tasks. Given ˜F_θ^t−1 derived by all previous tasks, ˜F_θ^tis updated as

F˜θ^t= αFθ^t+ (1 − α) ˜F_θ^t−1, (6) where t is the index of the task; α is a hyperparameter in [0,1];

F_θ^tdenotes Fθ derived from the (t − 1)-th task; and ˜F_θ^tis the interpolation result of ˜F_θ^t−1and Fθ^t, corresponding to the information of past accumulations and curvatures.

3.3. Path optimization augmenting approach

Although Fθis equipped with rationality to avoid catastrophic forgetting, the commonly used curvature-based methods [25, 32] of deriving Fθ rely on point estimation, which only cap- ture local curvature information around θ^∗. In contrast, the path optimization-based method [24] considers the information over the optimization path on the loss surface. In particular, the importance score is determined by accumulating over the entire training trajectory, as illustrated in Figure 3.

Figure 3: Relationship between the real loss (blue), curvature- based approximate loss (green), and path optimization-based approximate loss (red) while adapting the SE model.t = 0 and t = T are the start and end times, respectively.

By using the first-order Taylor approximation and setting ts

and teas the start and end steps of the t-th task, the change in loss L over the time from tsto tecan be written as

L(θ(te)) − L(θ(ts)) ≈ Z te

ts

(∇θL(θ(t)) ·θ(t) dt )dt

=X

i

( Z t_e

ts

∂L

∂θi

dθi

dtdt),

(7)

where i is the index of the SE model parameter. To simplify the description, we denote (Rte

ts

∂L

∂θ_i dθi

dtdt) as −∆L^ti. Therefore, the change in the total loss can be represented as the summa- tion of the individual loss ∆L^tiassociated with each parameter.

We put a minus sign on the left side of ∆L^ti to make the sign consistent with the regularization term. Practically, we replace Rte

ts

∂L

∂θi dθ_i

dtdt withPte−1 τ =t_s

∂L

∂θi(θi(τ + 1) − θi(τ )), where τ is

the index of iteration. From [24], the definition of importance scores as we begin to train the t-th task can be defined as

Sθ^t_i=X

t⁰<t

∆L^t_i⁰

(∆θ^t_i⁰)²+ , (8) where t⁰is the index of the task before the t-th task; θ_i^t⁰ is the i-th parameter of the SE model derived from training the t⁰-th task; ∆θ^ti⁰ is θi^t⁰ − θ^t_i⁰⁻¹; and is a hyperparameter with a positive value.

Similar to [34], we combined the advantages of curvature- based [25, 32] and path optimization-based [24] approaches.

The importance of parameter θiwhen training the t-th task can be written as ((1 − β) ˜F_θ^t_i+ βS_θ^t_i). Therefore, the training loss is defined as:

L˜^t(θ) = L^t(θ) + λX

i

((1 − β) ˜Fθ^t_i+ βSθ^t_i)(θi− θ^t−1_i )², (9) where t is the index of the task (if t is zero, ˜L^t(θ) is equivalent to L^t(θ)); θ^t−1_i is the i-th parameter after training the (t − 1)-th task; and β is a scalar with the value in [0,1], which determines the weight of the two strategies.

4. Experiment and Analysis

4.1. Experimental Setup

We evaluated the proposed SERIL system on two speech cor- pora: VCB [26] and TIMIT [27]. Three data sets were pre- pared, namely, the training, adaptation, and testing sets. For the training set, 2,000 utterances were randomly selected from the VCB corpus. Each utterance was contaminated with 13 types of noise (obtained from the NOISEX-92 database [35]) at 6 signal- to-noise (SNR) levels (ranging from -3 dB to 12 dB with a step of 3 dB), amounting to 156,000 (=2000×13×6) paired noisy- clean utterances in total. This training set is termed T0. To prepare the adaptation sets, we randomly selected another 300 utterances from the VCB corpus. These 300 utterances were contaminated with other 4 types of noise (obtained from the Nonspeech database [36]): cough, door moving, footsteps, and clap, at 6 SNR levels (from -3 dB to 12 dB with a step of 3 dB) to form 4 adaptation sets, termed T1, T2, T3, and T4. Each set contained 1,800 (=300×6) paired noisy-clean utterances.

For the testing set, we selected 1,680 utterances from the TIMIT data set. There were a total of five testing sets. The first testing set, E0, corresponded to the training set T0. The other four testing sets E1to E4 corresponded to the adaptation sets T1 to T4. For the testing set E0, there were 1,680 noisy utterances, and the noise types and SNR levels were the same as those used in T0. Each utterance was contaminated with one of the 13 noise types at a particular SNR level (one out of 6 SNR levels was randomly specified). For each of the testing sets E1

to E4, there were also 1,680 noisy utterances, and each utterance was contaminated with one noise type at a particular SNR level (one out of the 6 SNR levels was randomly specified). Our implementation is publicly available for reproducibility¹.

Three standardized evaluation metrics were used to measure the performance: perceptual evaluation of speech quality (PESQ) [37], short-time objective intelligibility measure (STOI) [38], and extended STOI (eSTOI) [39]. PESQ was designed to evaluate the quality of processed speech. The higher the PESQ, the better the speech quality. Both STOI and eSTOI were designed to compute the speech intelligibility. The higher

1https://github.com/ChangLee0903/SERIL

(4)

(a) E0: original (b) E1: cough (c) E2: door moving (d) E3: footsteps (e) E4: clap

Figure 4: SDR^{ST SA}scores of incrementally learned models evaluated on five testing sets. The x-axis lists incrementally learned modelsM0,M1,M2,M3, andM4. The y-axis presents theSDR^{ST SA}score. The scores of the unprocessed noisy speech, baseline model, direct fine-tuning approach, and proposed SERIL are represented by yellow, gray, black, and blue lines, respectively.

STOI and eSTOI scores, the better the speech intelligibility. In addition, we also reported the SDR^{ST SA}scores to illustrate the learning process. The higher the SDR^{ST SA}score, the smaller the distortion of the spectral features.

4.2. Experimental Results

First, we compared SERIL and the direct fine-tuning approach in terms of the adaptation capability and the degree of catastrophic forgetting. We used the training set T0 to train one baseline model, termed M0. Then, based on the four adaptation sets, we sequentially adapted the model from M0 to M1using T1, M1to M2using T2, M2to M3using T3, and M3to M4

using T4. The five models (M0to M4) were then tested on the five testing sets (E0to E4). The SDR^{ST SA}scores of the five models tested on the five testing sets are shown in Figure 4. The results of the baseline model without adaptation and the scores of unprocessed noisy speech are also given for comparison.

From the figure, we note that although the baseline model M0performs well on E0, where the noise types and SNR levels are matched during the training and testing stages, notable degradation is observed for the mismatched conditions (cf. the gray lines on E1 to E4). Further, both SERIL and the direct fine-tuning approach effectively adapt the SE model to each target domain and achieve good performance. For example, in Figure 4(b), M1achieves the best performance on E1for both SERIL and the direct fine-tuning approach. The model trained by direct fine-tuning tends to forget the previously learned SE capability, whereas the model trained by SERIL can maintain good SE performance for previously learned noise types. For instance, in Figure 4(b), the performance of M4trained by direct fine-tuning is considerably reduced in E1, showing that the adapted model has “forgotten” the SE capability for the previously learned noise type. This is because each noise type has different structural characteristics in different frequency bands, so direct fine-tuning without proper constraints can severely distort the modeling of previous noise environments. In contrast, the performance drop of the SERIL system for the same training-testing case is relatively minor. Consistent trends can be observed for all testing sets.

Table 1 shows the SDR^{ST SA}, PESQ, STOI, and eSTOI scores of the final model (M4) learned using the fine-tuning method and SERIL on the five testing sets. The scores of unprocessed noisy speech and the baseline model without adaptation (M0) are also listed for comparison. Several observations can be drawn from the table. First, SERIL performs as well as direct fine-tuning in the current noise environment in terms of all metrics (cf. the “clap” column in Table 1). Second, SERIL al- ways outperforms direct fine-tuning for previous environments in terms of all metrics (cf. the “original” to “footsteps” columns in Table 1). Third, SERIL performs better than the baseline

Table 1: SDR^{ST SA}, PESQ, STOI, and eSTOI scores of model M4 trained by the fine-tuning method (F) and SERIL (R). The results of unprocessed noisy speech (N) and the baseline model M0without adaptation (P) are listed for comparison.

Metric M original cough door foot- moving steps clap

SDR^{ST SA}

N 6.23 6.43 6.87 6.05 6.31

P 11.75 7.17 7.75 7.74 7.03

F 6.99 8.39 8.72 8.27 13.05

R 9.31 10.15 10.97 10.07 13.11

PESQ

N 2.266 2.041 1.864 1.868 1.474

P 2.708 2.118 2.059 2.015 1.603

F 2.406 2.204 2.339 2.133 2.948

R 2.461 2.375 2.581 2.381 2.936

STOI

N 0.816 0.788 0.743 0.778 0.789

P 0.869 0.798 0.779 0.799 0.801

F 0.811 0.816 0.825 0.829 0.923

R 0.826 0.839 0.859 0.855 0.931

eSTOI

N 0.624 0.692 0.648 0.744 0.782

P 0.721 0.695 0.661 0.745 0.788

F 0.638 0.698 0.687 0.745 0.853

R 0.664 0.717 0.731 0.763 0.853

model in all testing environments except for “original”, which is under a matched training-testing condition for the baseline model. It is worth noting that compared with the direct fine- tuning approach, SERIL requires only a small amount of additional computational cost and storage to set the constraints when performing model adaptation. However, SERIL can produce performance comparable to the direct fine-tuning approach in each new environment while overcoming the catastrophic forgetting problem in old environments.

5. Concluding Remarks

When deploying an SE system in real-world applications, it is common to encounter a new noisy environment and re-visit to previous noisy environments. Although the direct fine-tuning approach can effectively adapt SE models to new environments, the adapted SE model may suffer from the catastrophic forgetting problem. The proposed SERIL model not only yields comparable performance to the direct fine-tuning approach but also effectively overcomes the catastrophic forgetting problem. To the best of our knowledge, this paper is the first work that in- corporates incremental learning into SE tasks. Our experimental results confirmed the effectiveness of the proposed SERIL system for SE model adaptation and avoiding catastrophic forgetting. Based on the promising results, we believe that the proposed SERIL model can be used in various edge-computing devices, where the acoustic condition changes frequently and the cost of online retraining is high. In addition, we note that using an appropriate weight, λ, to combine the curvature-based and path optimization-based strategies can provide better SE performance in most tasks. Derivation of an algorithm that can au- tomatically determine the optimal λ is worthy of further study.

(5)

6. References

[1] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed.

USA: CRC Press, Inc., 2013.

[2] K.-Y. Chen, S.-H. Liu, B. Chen, H.-M. Wang, and H.-H. Chen,

“Exploring the use of unsupervised query modeling techniques for speech recognition and summarization,” Speech Communication, vol. 80, pp. 49–59, 2016.

[3] A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W.

Schuller, “Towards robust speech emotion recognition using deep residual networks for speech enhancement,” in Proc. Interspeech, 2019.

[4] K. Tan, X. Zhang, and D. Wang, “Real-time speech enhancement using an efficient convolutional recurrent network for dual- microphone mobile phones in close-talk scenarios,” in Proc.

ICASSP, 2019.

[5] C.-H. Lee, K.-L. Chen, F. Harris, B. D. Rao, and H. Garudadri,

“On Mitigating Acoustic Feedback in Hearing Aids with Fre- quency Warping by All-Pass Networks,” in Proc. Interspeech, 2019.

[6] Y. Lai, F. Chen, S. Wang, X. Lu, Y. Tsao, and C. Lee, “A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,” IEEE Transac- tions on Biomedical Engineering, vol. 64, pp. 1568–1578, 2017.

[7] M. Kolbæk, Z. Tan, and J. Jensen, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 153–167, 2017.

[8] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in INTER- SPEECH, 2015.

[9] B. Xia and C. Bao, “Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification,”

Speech Communication, vol. 60, pp. 13 – 29, 2014.

[10] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florˆencio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet,” in Proc. Interspeech, 2017.

[11] J. Qi, J. Du, S. M. Siniscalchi, and C. Lee, “A theory on deep neural network based vector-to-vector regression with an illustration of its expressive power in speech enhancement,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 27, 2019.

[12] S. Wang, W. Li, S. M. Siniscalchi, and C. Lee, “A cross-task trans- fer learning approach to adapting deep speech enhancement models to unseen background noise using paired senone classifiers,”

in Proc. ICASSP, 2020.

[13] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” in Proc. Interspeech, 2013.

[14] Y. Xu, J. Du, L. Dai, and C. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 23, pp.

7–19, 2015.

[15] Y.-C. Lin, Y.-T. Hsu, S.-W. Fu, Y. Tsao, and T.-W. Kuo, “IA- NET: Acceleration and compression of speech enhancement using integer-adder deep neural network,” in Proc. Interspeech, 2019.

[16] S. Fu, C. Liao, Y. Tsao, and S. Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proc. ICML, 2019.

[17] Q. Wang, J. Du, L. Chai, L.-R. Dai, and C.-H. Lee, “A maximum likelihood approach to masking-based speech enhancement using deep neural network,” in Proc. ISCSLP, 2018.

[18] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain- Adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, pp. 2096––2030, 2016.

[19] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptive speech enhancement using domain adversarial training,” in Proc.

Interspeech, 2019.

[20] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Proc. NeurIPS, 2014.

[21] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A deep convolutional activation feature for generic visual recognition,” in Proc. ICML, 2014.

[22] M. C. Choy, D. Srinivasan, and R. L. Cheu, “Neural networks for continuous online learning and control,” IEEE Transactions on Neural Networks, vol. 17, pp. 1511–1531, 2006.

[23] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio,

“An empirical investigation of catastrophic forgetting in gradient- based neural networks,” in Proc. ICLR, 2014.

[24] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in Proc. ICML, 2017.

[25] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des- jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska- Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell,

“Overcoming catastrophic forgetting in neural networks,” Proc. of the National Academy of Sciences, pp. 3521–3526, 2017.

[26] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,” in Proc. O-COCOSDA/CASLRE, 2013.

[27] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pal- lett, and N. Y. Dahlgren, “DARPA TIMIT acoustic-phonetic con- tinous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, p. 27403, 1993.

[28] M. Kolbæk, Z. Tan, S. H. Jensen, and J. Jensen, “On loss functions for supervised monaural time-domain speech enhancement,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 28, pp. 825–838, 2020.

[29] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measure- ment in blind audio source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 1462–

1469, 2006.

[30] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR- half-baked or well done?” in Proc. ICASSP, 2019.

[31] F. de Parseval, Les Parseval et leurs alliances pendant trois siècles (1594-1900): Par Frédéric de Parseval Parseval Généalogies et Souvenirs de Famille. J. Castanet, 1901.

[32] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell, “Progress & Compress: A scalable framework for continual learning,” in Proc. ICML, 2018.

[33] S. Kolouri, N. A. Ketz, A. Soltoggio, and P. K. Pilly, “Sliced cramer synaptic consolidation for preserving deeply learned rep- resentations,” in Proc. ICLR, 2020.

[34] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr, “Rie- mannian walk for incremental learning: Understanding forgetting and intransigence,” in Proc. ECCV, 2018.

[35] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems.” Speech Communication, vol. 12, pp. 247–251, 1993.

[36] G. Hu and D. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 18, pp. 2067–2079, 2010.

[37] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001.

[38] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010.

[39] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 24, pp. 2009–2022, 2016.