2.5 Summary
3.3.5 Ensemble HELM for Speech Dereverberation
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
3.3.5 Ensemble HELM for Speech Dereverberation
We now present the proposed ensemble HELM framework for speech dereverbera
tion. Fig. 3.3 shows both the offline and online stages of the ensemble HELM frame
work. In the offline stage, multiple HELM component models are trained individually and independently to learn the spectral mapping function for each reverberation condi
tion. Subsequently, a fusion model is estimated to combine the outputs of these models to generate the final anechoic speech. In our case, four HELMbased component models are trained corresponding to four specific reverberation conditions (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}), which are denoted as HELM0.3, HELM0.6, HELM0.9, and HELM1.2, respec
tively. In Fig. 3.3, these four models are presented with different colors and border styles in the offline stage (and in the corresponding online stage). The outputs of the four com
ponent models are denoted as Y0.3, Y0.6, Y0.9, and Y1.2, respectively. We then combine these outputs to form an integrated vector XI, such that XI= {Y0.3, Y0.6, Y0.9, Y1.2}. The fusion model, HELMI, intends to compute a mapping function to transform the integrated vector XIto the anechoic speech vector Y .
In the online stage, the test utterances are first converted into LPS features and phase parts. The reverberated LPS features are processed through each component model. The outputs of all the component models are later integrated and processed through a fusion model, (HELM)I, to produce the anechoic speech signal. The phase of the original rever
berated utterances is used along with the overlapadd and ISTFT operations to reconstruct the waveform of the dereverberated speech utterances. In the following discussion, we will name the ensemble framework using the HELM as eHELM. In addition to the HELM, we also use HELM(Hwy) and HELM(Res), as shown in Fig. 3.3, to form the ensemble frameworks, which are termed eHELM(Hwy) and eHELM(Res), respectively.
‧
Figure 3.3: Offline and online stages of the ensemble HELM (eHELM) dereverberation framework.
3.4 Experiments
3.4.1 Experimental Setup
Description of the TIMIT Corpus
The TIMIT [121] corpus was used to evaluate the performance of the proposed HELM solutions. We selected 300 utterances as the training data and 100 utterances as the test
ing data, being careful to ensure no overlap occurred between the training and testing speakers. While designing the reverberated data, we also made sure no overlap occurred between the speakers and the speech content of the training and testing sentences. Four room conditions were simulated to generate different acoustic characteristics: room 1 was of the size 12×4×6 m, room 2 was 14×10×8 m, room 3 was 18×14×8 m, and room 4 was 20×20×20 m. The microphone positions for these four rooms were at 2×2×1.6 m,
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2×2×1.8 m, 2×2×2 m, and 6×2×2.2 m, respectively. We designed two training sets: (1) in training set 1, 300 training utterances were convolved with a single RIR (1RIR) along with four RT60 (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}) to generate 300×4(RT60)×1(RIR) = 1200 reverberated training utterances (1.3 hours of reverberated training data); (2) in train
ing set 2, we simulated more reverberation conditions by considering three RIRs (3RIRs) for each RT60 to generate 300×4(RT60)×3(RIRs) = 3600 reverberated training utterances (4 hours of reverberated training data).
The aforementioned 100 testing utterances were used to prepare two different evalu
ation sets: matched condition and mismatched condition. In the matched condition, four rooms were simulated with the same RT60, i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}, as that used in the training set but with different room dimensions. The rooms were of sizes 10×4×6 m, 12×14×6 m, 16×16×8 m, and 22×20×12 m, respectively. The microphone posi
tions were also different from that of training set; the positions were 2×2×1.6 m, 3×2×2 m, 4×3×2.2 m, and 5×2×2.5 m, respectively. In the mismatched condition scenario, three rooms of dimensions 10×4×6 m, 14×14×6 m, and 18×16×8 m, respectively, were simulated with RT60 of 0.4, 0.8, and 1.0 s, respectively. The microphone for the mis
matched test reverberated data was placed at the same position as of the matched testing data, namely, 2×2×1.6 m, 3×2×2 m, and 4×3×2.2 m, respectively.
Evaluation Metrics
We evaluated our approach using six standardized objective metrics: PESQ [127], STOI [114], FwSSNR [128], and SRMR [129]. The PESQ score was used to measure the speech quality of the dereverberated speech that ranges between 0.5 and 4.5. In effect, the higher the PESQ score, the better the speech quality. The STOI computes the speech in
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
telligibility based on the correlation between the temporal envelopes of the dereverberated and anechoic speech over shorttime segments. The STOI score ranges between 0 and 1, a higher score indicating better speech intelligibility. The FwSSNR measures the ratio of the dereverberated and anechoic speech with consideration of the articulation index weight.
The SRMR is a nonintrusive quality measurement of the reverberated and dereverberated speech. Higher FwSSNR and SRMR scores denote that the dereverberated speech is closer to the anechoic speech. In addition, two objective measures, the cepstrum distance (Cep) and log likelihood ratio (LLR) [128] are also measured to estimate the quality of the dere
verberated speech signal. Cep estimates the spectral distance between the enhanced and clean reference speech signals whilst LLR computes the ratio of the discrepancy between them. Smaller values of Cep and LLR score denote less distortion with better speech qual
ity. The SRMR, Cep, and LLR metrics are provided by the REVERB challenge designed specifically for the dereverberation task [123]. All the evaluation metrics (except SRMR) are obtained by comparing the estimated speech with the corresponding reference speech.
The SRMR is obtained by computing the speechtoreverberation modulation energy ratio of the estimated speech signal directly.
In this chapter, speech signals were processed using a moving window with a frame size of 16 ms and a frame shift of 8 ms. Subsequently, 129 dimensional LPS features were calculated for each speech frame.
3.4.2 Experimental Results
We first assessed the performance of the proposed frameworks using training set 1, namely 1200 reverberated anechoic utterance pairs (the training set obtained from using only a single RIR). Subsequently, we extended the experiments by considering a relatively
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Table 3.1: Average PESQ scores of HELM, HELM(Hwy), HELM(Res) and Reverb speech under specific reverberated conditions.
ws
Testing RT60 0.3 0.6 0.9 1.2 Average Reverb 2.7167 2.1194 1.6155 1.4342 1.9714
ws = 0
HELM 2.7901 2.2743 1.8365 1.7001 2.1503 HELM(Hwy) 2.7968 2.2762 1.8403 1.6979 2.1528 HELM(Res) 2.8137 2.2747 1.8395 1.7018 2.1574
large training set (training set 2), where 3600 reverberatedanechoic utterance pairs were employed to train the proposed models (training data generated with three RIRs).
HELM(Hwy) and HELM(Res)
We first intend to compare the performance of HELM, HELM(Hwy), and HELM(Res) against reverberated speech signals (denoted as Reverb). For an impartial comparison, all HELM models were trained using the entire training data covering all reverberation con
ditions (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}), i.e., 1200 reverberatedanechoic utterance pairs, and tested using the matched testing set. In this set of experiments, we used the same reg
ularization parameters of the HELM as reported in [94]. Table 3.1 lists the PESQ results of the conventional HELM, and for the the proposed HELM(Hwy) and HELM(Res) ap
proaches. All HELM configurations comprised three hidden nonlinear layer containing 1000, 1000, and 4000 neurons ([1000 1000 4000]), respectively. The sigmoidal activa
tion function was employed for the three HELM methods. Furthermore, no contextual information was used, that is, the current speech frame was only fed at the HELM input layer, and neighbor frames were not taken into account during training or testing this is equivalent to setting the context input window size (ws) to zero. The last column in
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Table 3.1 shows the average PESQ scores over all reverberation conditions. The highest PESQ score for each particular reverberation condition has been highlighted in boldface.
The experimental results in Table 3.1 demonstrate that the conventional HELM notably outperforms Reverb (reverberated speech signals). The improvement was higher for RT60
≥ 0.6 s, indicating more severe reverberation conditions. Furthermore, HELM(Hwy) and HELM(Res) models achieved slightly better average PESQ results compared with the conventional HELM, confirming the effectiveness of incorporating lowlevel information into the spectral mapping stage.
Context Analysis
In this section, we intend to investigate the correlation of the dereverberation perfor
mance and the context information (=2×ws+1) by varying the context (ws) from 1 (ws = 0) to 11 (ws = 5). Table 3.2 presents the PESQ scores delivered by HELM, HELM(Hwy), and HELM(Res) with different ws values. The same architectures, i.e., [1000 1000 4000], were used for these three HELM models. From Table 3.2, we first note that under mild reverberated conditions (i.e., RT60 = 0.3 and 0.6 s), less context information can yield more effective dereverberation results for all three HELM models. On the other hand, under more severe reverberated conditions (i.e., RT60 = 0.9 s and 1.2 s), more context information provides higher PESQ scores. The results further show that HELM(Res) con
sistently outperforms HELM and HELM(Hwy) in terms of average PESQ for every ws value, demonstrating that HELM(Res) can achieve a more effective dereverberation per
formance as both approaches (HELM(Hwy) and HELM(Res)) share the same underpin
nings i.e., training very deep neural architectures by incorporating the lowlevel informa
tion to higher levels. To quantify the statistical significance of the proposed frameworks,
‧
we employed a twosample tsignificance test for each test reverberation condition (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}) using ws = 0 and ws = 3, for which we obtained the best average performance. The significance test was applied to examine whether the improve
ment in performance was due to some random effect. For the null hypothesis, we assumed that the means of the two frameworks (i.e., µHELM(Hwy) and µHELM(Res)) were significantly
Table 3.2: Average PESQ scores of HELM, HELM(Hwy), and HELM(Res) with different context information.
ws
Testing RT60 0.3 0.6 0.9 1.2 Avg.
Reverb 2.7167 2.1194 1.6155 1.4342 1.9714
ws = 0
HELM 2.7901 2.2743 1.8365 1.7001 2.1503 HELM(Hwy) 2.7968 2.2762 1.8403 1.6979 2.1528 HELM(Res) 2.8137 2.2747 1.8395 1.7018 2.1574
ws = 1
HELM 2.6607 2.3062 1.8943 1.7472 2.1521 HELM(Hwy) 2.6697 2.3107 1.8973 1.7487 2.1566 HELM(Res) 2.7272 2.3059 1.9031 1.7461 2.1705
ws = 2
HELM 2.6100 2.3081 1.9304 1.7985 2.1618 HELM(Hwy) 2.6574 2.3332 1.9404 1.8027 2.1834 HELM(Res) 2.7186 2.3586 1.9565 1.8037 2.2093
ws = 3
HELM 2.5944 2.3308 1.9821 1.8454 2.1882 HELM(Hwy) 2.6431 2.3314 1.9837 1.8448 2.2007 HELM(Res) 2.7804 2.4101 2.0234 1.8657 2.2699
ws = 4
HELM 2.5691 2.2978 2.0038 1.8749 2.1864 HELM(Hwy) 2.5705 2.3163 2.0083 1.8765 2.1929 HELM(Res) 2.6684 2.3658 2.0317 1.8846 2.2376
ws = 5
HELM 2.5396 2.2880 2.0341 1.8944 2.1890 HELM(Hwy) 2.5704 2.3447 2.0478 1.9017 2.2161 HELM(Res) 2.6679 2.3717 2.0787 1.9113 2.2574
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
different from that of the original HELM (µHELM). The significance value (pvalue) for ws = 0 indicated that only the HELM(Res) for the RT60∈ 0.3 s condition failed to reject the null hypothesis pvalue = 0.02 (≤ 0.05) for the HELMHELM(Res). Nonetheless, HELM(Res) was demonstrated as being significantly better than HELM for RT60∈ {0.3, 0.6, 0.9, 1.2} reverberation condition by providing a very small pvalue for ws = 3, also characterizing better capabilities when compared with HELM(Hwy). Residual networks reformulate the desired transformation with respect to a reference input layer as iden
tity shortcuts that are parameterfree, facilitating better learning capabilities; whereas the highway networks have parameters [130] that may cause overfitting for small amounts of training data, resulting in poor performance compared to HELM(Res). We observed the same findings as reported in [130] by obtaining significant performance improvement for residual networks compared to highway networks. We can also note that among all of the context information ws = 3 achieved the best average PESQ results consistently over the three HELM models. Therefore, we report the results of using ws = 3 in the following discussion.
Ensemble HELM
In this section, we present our results concerning the ensemble HELM frameworks.
Training set 1, namely, 1200 reverberatedanechoic utterance pairs, was employed in this set of experiments. In the offline stage, we built component models based on acoustic knowledge (thus denoted as knowledgebased approach, KB) to split the entire dataset into subsets. Each subset of data was used to train one dereverberation model to charac
terize the mapping function from a specific reverberated condition to the clean condition as described in Section 3.3.5. Here, four component models were trained corresponding
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
to four reverberation conditions (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}). Subsequently, a fusion model was trained to combine the outputs of the four component models and generate the final dereverberated speech signal that matches the reference anechoic one. In the on
line stage, the input speech was independently processed by each of the four component models. The fusion model then integrated the outputs of the four models to obtain the dereverberated speech.
To confirm the effectiveness of the KB scheme, we designed another comparative data clustering scheme that divided the training data into subsets in a randomsampling (RS) manner, where no knowledge of environment characteristics was involved for data clustering. Based on the RS scheme, four subsets of training data were prepared, and each subset contained 300 reverberatedanechoic utterance pairs by randomly sampling from the entire set of 1200 reverberatedanechoic utterance pairs. Each subset was used to prepare a dereverberation model. Once the four models were trained, a fusion model was estimated. In the online stage, the incoming test utterance was processed by the four component models, and the fusion model integrated the outputs of these four models to generate the final dereverberated speech.
Table 3.3: Average PESQ scores of four HELM frameworks with the RS and KB schemes.
Ensemble Method Testing RT60 0.3 0.6 0.9 1.2 Avg.
Reverb 2.7167 2.1194 1.6155 1.4342 1.9714
RS eHELM 2.4860 2.2296 1.9494 1.8366 2.1254
eHELM(Hwy) 2.5207 2.2954 1.9607 1.8472 2.1560 eHELM(Res) 2.7101 2.3767 2.0331 1.9001 2.2550
KB eHELM 2.5686 2.3155 1.8827 1.7551 2.1305
eHELM(Hwy) 2.6473 2.3365 1.8829 1.7717 2.1596 eHELM(Res) 2.9265 2.4672 1.9471 1.8132 2.2885
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
We used the original HELM, HELM(Hwy), and HELM(Res) to build the ensem
ble and fusion models; the corresponding ensemble frameworks were termed eHELM, eHELM(Hwy), and eHELM(Res), respectively. For all of these ensemble HELM mod
els, we employed the same number of hidden layers and neurons for a fair comparison.
Table 3.3 presents the average PESQ score for the three ensemble HELM frameworks with both the RS and KB clustering schemes. From Table 3.3, we first note that all of the ensemble HELM frameworks with either KB or RS clustering schemes outperformed the Reverb speech with notable margins except for the RT60 = 0.3 s condition, where the Reverb had a better PESQ score than the three ensemble HELM frameworks with the KB clustering. Next, we observe that eHELM performed the worst by exhibiting a lower PESQ score at each RT60 (RT60∈ {0.3, 0.6, 0.9, 1.2}) among the three ensemble HELM frameworks. Moreover, in relatively mild reverberation conditions such as RT60∈ {0.3, 0.6}, all HELM frameworks with the KB clustering scheme achieved better performance than those with the RS clustering. On the other hand, in relatively severe reverberation conditions, i.e., RT60∈ {0.9, 1.2}, all ensemble HELM frameworks with the RS cluster
ing scheme outperformed that of the KB counterparts. For RS and KB clustering schemes, eHELM(Res) illustrated better performance by yielding consistent improvements at each RT60 compared to all other frameworks. In terms of average PESQ scores, the KB clus
tering scheme yielded higher PESQ scores as compared with the RS clustering scheme.
Therefore, in the following discussion, we only report the results of the ensemble HELM frameworks adopting the KB clustering scheme.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Ensemble HELM vs. Existing Approaches
In this subsection, we compare the proposed ensemble HELM frameworks with con
ventional dereverberation approaches. In this set of experiments, we employed training set 2 (3600 reverberatedanechoic utterance pairs, as described in Section 3.4.1) to train the three ensemble HELM frameworks, namely eHELM, eHELM(Hwy), and eHELM(Res).
Two conventional dereverberation approaches were carried out for comparison. The first is called the WuWang approach, which is a twostage speech dereverberation system that adopts inverse filtering and spectral subtraction to handle early and late reverberations [131]. The second approach is a recently proposed coherenttodiffuse power ratio (CDR) estimation [132] method. For this approach, the CDRs between two omnidirectional mi
crophones are estimated for dereverberation using several known CDR estimators. In
Table 3.4: Average PESQ scores of ensemble HELM and IDEA frameworks in the matched testing conditions.
Testing RT60 0.3 0.6 0.9 1.2 Avg.
Reverb 2.7167 2.1194 1.6155 1.4342 1.9714 WuWang 2.5450 2.1511 1.8054 1.6933 2.0487 CDR 2.7507 2.1749 1.8190 1.6981 2.1106 IDEA 2.3712 2.1638 1.8196 1.7047 2.0148 IDEA(Hwy) 2.6314 2.1932 1.8331 1.7355 2.0982 IDEA(Res) 2.7260 2.2277 1.8475 1.7357 2.1342 eHELM 2.4899 2.2724 1.9129 1.7799 2.1137 eHELM(Hwy) 2.6010 2.3674 1.9484 1.7961 2.1782 eHELM(Res) 2.8531 2.4962 1.9943 1.8278 2.2936
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
our sets of experiments, the estimator with the unknown direction of arrival (DOA), and unknown noise coherence was adopted for comparison. In addition to conventional ap
proaches, we compared the learning performance of the ensemble frameworks against a recently proposed neuralbased integrated deep and ensemble learning algorithm (IDEA) [124], which uses DDAE models as the component models and a CNN as the fusion model.
For comparison with the HELM(Hwy) and HELM(Res), we adopted the highway DDAE and residual DDAE as component models, while using CNN as the fusion model; these systems are termed IDEA, IDEA(Hwy), and IDEA(Res), respectively. For the three IDEA systems, we followed the same model architectures as that used in [124] because the total number of training samples in the experiment was comparable to that in [124]. Moreover, the preliminary experiments confirmed that the IDEA architectures achieved very good performance for the TIMIT dataset. Therefore, we decided to use the best architecture of IDEA in [124] as a comparative dereverberation system. Each DDAE framework model in the abovementioned IDEA systems consisted of three hidden layers, with each layer having 2048 hidden neurons; the CNN fusion model consisted of three hidden layers, i.e., two convolutional layers with each layer containing 32 channels, and a fully connected layer with 2048 nodes. The learning rate for the ensemble learning models was set to 0.0002, the same as that used in [108], with a minibatch size of 128. The number of epochs was set to 100. The results of the Reverb, WuWang system, three IDEA systems, and three ensemble HELM systems are reported in Table 3.4.
From Table 3.4, the proposed ensemble HELM frameworks, i.e., eHELM, eHELM(Hwy), and eHELM(Res), notably outperformed both the Reverb and WuWang approaches. Among the three eHELM systems, eHELM(Res) yielded the best performance, confirming the ef
fectiveness of the residual architecture. eHELM(Res) also outperformed all of the IDEA
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
systems consistently over different reverberated conditions.
Ensemble HELMs with More Complex Architectures
By comparing the results in Tables 3.3 and 3.4, we note that the three ensemble HELM frameworks consistently improved when we increased the training utterance pairs from 1200 to 3600. That motivated us to increase the complexity of the component mod
els in the ensemble HELM frameworks and verify whether further improvements could be attained. For all of the HELM models, a relatively more complex architecture of [1000 1000 10000] was used because such a setup gave better results in our previous speech enhancement experiments [41]; the corresponding ensemble HELM models were termed eHELMD, eHELMD(Hwy), and eHELMD(Res). For comparison, we also consid
ered DDAE models with deeper structures in the IDEA framework as those used in [124]:
Each DDAE model had six hidden layers, with each layer having 2048 hidden neurons;
a CNN model consisting of three hidden layers–two convolutional layers with each layer containing 32 channels; and a fullyconnected layer with 2048 nodes. The correspond
ing ensemble IDEA frameworks were termed IDEAD, IDEAD(Hwy), and IDEAD(Res).
We first presented the results of these models tested on matched conditions (namely, the testing conditions consisted of RT60∈ 0.3, 0.6, 0.9, 1.2). Table 3.5 displays the PESQ performance of the IDEA and the ensemble HELM frameworks. Comparing Tables 3.4 and 3.5, we can first note that by using more complex structures, both the IDEA and the ensemble HELM frameworks demonstrated better performance. Moreover, from Ta
ble 3.5, IDEAD(Res) and eHELMD(Res) performed the best among the IDEA and en
semble HELM frameworks, respectively, which is consistent with the results reported in Table 3.2, again confirming the effectiveness of the residual structure.
‧
Table 3.5: Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the matched testing conditions.
Testing RT60 0.3 0.6 0.9 1.2 Avg.
IDEAD 2.4485 2.1994 1.8379 1.7282 2.0535 IDEAD(Hwy) 2.7598 2.2424 1.8553 1.7339 2.1478 IDEAD(Res) 2.8538 2.2808 1.9010 1.7744 2.2025 eHELMD 2.5379 2.3302 1.9218 1.7842 2.1435 eHELMD(Hwy) 2.6408 2.3755 1.9509 1.8159 2.1957 eHELMD(Res) 2.8902 2.5242 2.0177 1.8496 2.3204
In addition to PESQ scores, we reported the average STOI, SRMR, FwSSNR, Cep, and LLR results in Fig. 3.4. Here, we only show IDEAD(Res) and eHELMD(Res) results, as these models achieved better performance for the IDEA and ensemble HELM frameworks, respectively, as shown in Table 3.5. The results of the Reverb, WuWang, and CDR ap
proaches were also listed for comparison. The average results presented in Fig. 3.4 were
0
STOI SRMR FwSSNR Cep LLR
Reverb
Figure 3.4: Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, WuWang, CDR, IDEAD(Res), and eHELMD(Res) in the matched testing conditions.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
scaled scores to 0 and 1. From Fig. 3.4, we observe that eHELMD(Res) outperformed the other approaches by providing better speech intelligibility (higher STOI scores) with an average score of 0.7288 compared with Reverb (0.6326), WuWang (0.6081), CDR (0.6968), and IDEAD(Res) (0.7035) for matched testing conditions. Similarly, the pro
posed eHELMD(Res) framework maintained a better reverberation suppression by con
tributing a high FwSSNR score and low Cep and LLR scores. The figure demonstrates that the conventional approaches i.e., WuWang and CDR, could attain higher SRMR scores, but they revealed to demonstrate the worst performance on Cep and LLR metrics, which are highly correlated with the quality of the dereverberated speech signals and indicate an overestimation of the reverberation. The overestimation is caused by a suppression due to
tributing a high FwSSNR score and low Cep and LLR scores. The figure demonstrates that the conventional approaches i.e., WuWang and CDR, could attain higher SRMR scores, but they revealed to demonstrate the worst performance on Cep and LLR metrics, which are highly correlated with the quality of the dereverberated speech signals and indicate an overestimation of the reverberation. The overestimation is caused by a suppression due to