Ensemble HELM for Speech Dereverberation - 多層次極限學習機於語音訊號處理上的應用

2.5 Summary

3.3.5 Ensemble HELM for Speech Dereverberation

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

3.3.5 Ensemble HELM for Speech Dereverberation

We now present the proposed ensemble HELM framework for speech dereverbera

tion. Fig. 3.3 shows both the offline and online stages of the ensemble HELM frame

work. In the offline stage, multiple HELM component models are trained individually and independently to learn the spectral mapping function for each reverberation condi

tion. Subsequently, a fusion model is estimated to combine the outputs of these models to generate the final anechoic speech. In our case, four HELMbased component models are trained corresponding to four specific reverberation conditions (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}), which are denoted as HELM0.3, HELM0.6, HELM0.9, and HELM1.2, respec

tively. In Fig. 3.3, these four models are presented with different colors and border styles in the offline stage (and in the corresponding online stage). The outputs of the four com

ponent models are denoted as Y0.3, Y0.6, Y0.9, and Y1.2, respectively. We then combine these outputs to form an integrated vector X_I, such that X_I= {Y_0.3, Y_0.6, Y_0.9, Y_1.2}. The fusion model, HELM_I, intends to compute a mapping function to transform the integrated vector XIto the anechoic speech vector Y .

In the online stage, the test utterances are first converted into LPS features and phase parts. The reverberated LPS features are processed through each component model. The outputs of all the component models are later integrated and processed through a fusion model, (HELM)_I, to produce the anechoic speech signal. The phase of the original rever

berated utterances is used along with the overlapadd and ISTFT operations to reconstruct the waveform of the dereverberated speech utterances. In the following discussion, we will name the ensemble framework using the HELM as eHELM. In addition to the HELM, we also use HELM(Hwy) and HELM(Res), as shown in Fig. 3.3, to form the ensemble frameworks, which are termed eHELM(Hwy) and eHELM(Res), respectively.

‧

Figure 3.3: Offline and online stages of the ensemble HELM (eHELM) dereverberation framework.

3.4 Experiments

3.4.1 Experimental Setup

Description of the TIMIT Corpus

The TIMIT [121] corpus was used to evaluate the performance of the proposed HELM solutions. We selected 300 utterances as the training data and 100 utterances as the test

ing data, being careful to ensure no overlap occurred between the training and testing speakers. While designing the reverberated data, we also made sure no overlap occurred between the speakers and the speech content of the training and testing sentences. Four room conditions were simulated to generate different acoustic characteristics: room 1 was of the size 12×4×6 m, room 2 was 14×10×8 m, room 3 was 18×14×8 m, and room 4 was 20×20×20 m. The microphone positions for these four rooms were at 2×2×1.6 m,

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2×2×1.8 m, 2×2×2 m, and 6×2×2.2 m, respectively. We designed two training sets: (1) in training set 1, 300 training utterances were convolved with a single RIR (1RIR) along with four RT60 (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}) to generate 300×4(RT60)×1(RIR) = 1200 reverberated training utterances (1.3 hours of reverberated training data); (2) in train

ing set 2, we simulated more reverberation conditions by considering three RIRs (3RIRs) for each RT60 to generate 300×4(RT60)×3(RIRs) = 3600 reverberated training utterances (4 hours of reverberated training data).

The aforementioned 100 testing utterances were used to prepare two different evalu

ation sets: matched condition and mismatched condition. In the matched condition, four rooms were simulated with the same RT60, i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}, as that used in the training set but with different room dimensions. The rooms were of sizes 10×4×6 m, 12×14×6 m, 16×16×8 m, and 22×20×12 m, respectively. The microphone posi

tions were also different from that of training set; the positions were 2×2×1.6 m, 3×2×2 m, 4×3×2.2 m, and 5×2×2.5 m, respectively. In the mismatched condition scenario, three rooms of dimensions 10×4×6 m, 14×14×6 m, and 18×16×8 m, respectively, were simulated with RT60 of 0.4, 0.8, and 1.0 s, respectively. The microphone for the mis

matched test reverberated data was placed at the same position as of the matched testing data, namely, 2×2×1.6 m, 3×2×2 m, and 4×3×2.2 m, respectively.

Evaluation Metrics

We evaluated our approach using six standardized objective metrics: PESQ [127], STOI [114], FwSSNR [128], and SRMR [129]. The PESQ score was used to measure the speech quality of the dereverberated speech that ranges between 0.5 and 4.5. In effect, the higher the PESQ score, the better the speech quality. The STOI computes the speech in

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

telligibility based on the correlation between the temporal envelopes of the dereverberated and anechoic speech over shorttime segments. The STOI score ranges between 0 and 1, a higher score indicating better speech intelligibility. The FwSSNR measures the ratio of the dereverberated and anechoic speech with consideration of the articulation index weight.

The SRMR is a nonintrusive quality measurement of the reverberated and dereverberated speech. Higher FwSSNR and SRMR scores denote that the dereverberated speech is closer to the anechoic speech. In addition, two objective measures, the cepstrum distance (Cep) and log likelihood ratio (LLR) [128] are also measured to estimate the quality of the dere

verberated speech signal. Cep estimates the spectral distance between the enhanced and clean reference speech signals whilst LLR computes the ratio of the discrepancy between them. Smaller values of Cep and LLR score denote less distortion with better speech qual

ity. The SRMR, Cep, and LLR metrics are provided by the REVERB challenge designed specifically for the dereverberation task [123]. All the evaluation metrics (except SRMR) are obtained by comparing the estimated speech with the corresponding reference speech.

The SRMR is obtained by computing the speechtoreverberation modulation energy ratio of the estimated speech signal directly.

In this chapter, speech signals were processed using a moving window with a frame size of 16 ms and a frame shift of 8 ms. Subsequently, 129 dimensional LPS features were calculated for each speech frame.

3.4.2 Experimental Results

We first assessed the performance of the proposed frameworks using training set 1, namely 1200 reverberated anechoic utterance pairs (the training set obtained from using only a single RIR). Subsequently, we extended the experiments by considering a relatively

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Table 3.1: Average PESQ scores of HELM, HELM(Hwy), HELM(Res) and Reverb speech under specific reverberated conditions.

Testing RT60 0.3 0.6 0.9 1.2 Average Reverb 2.7167 2.1194 1.6155 1.4342 1.9714

ws = 0

HELM 2.7901 2.2743 1.8365 1.7001 2.1503 HELM(Hwy) 2.7968 2.2762 1.8403 1.6979 2.1528 HELM(Res) 2.8137 2.2747 1.8395 1.7018 2.1574

large training set (training set 2), where 3600 reverberatedanechoic utterance pairs were employed to train the proposed models (training data generated with three RIRs).

HELM(Hwy) and HELM(Res)

We first intend to compare the performance of HELM, HELM(Hwy), and HELM(Res) against reverberated speech signals (denoted as Reverb). For an impartial comparison, all HELM models were trained using the entire training data covering all reverberation con

ditions (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}), i.e., 1200 reverberatedanechoic utterance pairs, and tested using the matched testing set. In this set of experiments, we used the same reg

ularization parameters of the HELM as reported in [94]. Table 3.1 lists the PESQ results of the conventional HELM, and for the the proposed HELM(Hwy) and HELM(Res) ap

proaches. All HELM configurations comprised three hidden nonlinear layer containing 1000, 1000, and 4000 neurons ([1000 1000 4000]), respectively. The sigmoidal activa

tion function was employed for the three HELM methods. Furthermore, no contextual information was used, that is, the current speech frame was only fed at the HELM input layer, and neighbor frames were not taken into account during training or testing this is equivalent to setting the context input window size (ws) to zero. The last column in

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Table 3.1 shows the average PESQ scores over all reverberation conditions. The highest PESQ score for each particular reverberation condition has been highlighted in boldface.

The experimental results in Table 3.1 demonstrate that the conventional HELM notably outperforms Reverb (reverberated speech signals). The improvement was higher for RT60

≥ 0.6 s, indicating more severe reverberation conditions. Furthermore, HELM(Hwy) and HELM(Res) models achieved slightly better average PESQ results compared with the conventional HELM, confirming the effectiveness of incorporating lowlevel information into the spectral mapping stage.

Context Analysis

In this section, we intend to investigate the correlation of the dereverberation perfor

mance and the context information (=2×ws+1) by varying the context (ws) from 1 (ws = 0) to 11 (ws = 5). Table 3.2 presents the PESQ scores delivered by HELM, HELM(Hwy), and HELM(Res) with different ws values. The same architectures, i.e., [1000 1000 4000], were used for these three HELM models. From Table 3.2, we first note that under mild reverberated conditions (i.e., RT60 = 0.3 and 0.6 s), less context information can yield more effective dereverberation results for all three HELM models. On the other hand, under more severe reverberated conditions (i.e., RT60 = 0.9 s and 1.2 s), more context information provides higher PESQ scores. The results further show that HELM(Res) con

sistently outperforms HELM and HELM(Hwy) in terms of average PESQ for every ws value, demonstrating that HELM(Res) can achieve a more effective dereverberation per

formance as both approaches (HELM(Hwy) and HELM(Res)) share the same underpin

nings i.e., training very deep neural architectures by incorporating the lowlevel informa

tion to higher levels. To quantify the statistical significance of the proposed frameworks,

‧

we employed a twosample tsignificance test for each test reverberation condition (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}) using ws = 0 and ws = 3, for which we obtained the best average performance. The significance test was applied to examine whether the improve

ment in performance was due to some random effect. For the null hypothesis, we assumed that the means of the two frameworks (i.e., µ_HELM(Hwy) and µ_HELM(Res)) were significantly

Table 3.2: Average PESQ scores of HELM, HELM(Hwy), and HELM(Res) with different context information.

Testing RT60 0.3 0.6 0.9 1.2 Avg.

Reverb 2.7167 2.1194 1.6155 1.4342 1.9714

ws = 0

HELM 2.7901 2.2743 1.8365 1.7001 2.1503 HELM(Hwy) 2.7968 2.2762 1.8403 1.6979 2.1528 HELM(Res) 2.8137 2.2747 1.8395 1.7018 2.1574

ws = 1

HELM 2.6607 2.3062 1.8943 1.7472 2.1521 HELM(Hwy) 2.6697 2.3107 1.8973 1.7487 2.1566 HELM(Res) 2.7272 2.3059 1.9031 1.7461 2.1705

ws = 2

HELM 2.6100 2.3081 1.9304 1.7985 2.1618 HELM(Hwy) 2.6574 2.3332 1.9404 1.8027 2.1834 HELM(Res) 2.7186 2.3586 1.9565 1.8037 2.2093

ws = 3

HELM 2.5944 2.3308 1.9821 1.8454 2.1882 HELM(Hwy) 2.6431 2.3314 1.9837 1.8448 2.2007 HELM(Res) 2.7804 2.4101 2.0234 1.8657 2.2699

ws = 4

HELM 2.5691 2.2978 2.0038 1.8749 2.1864 HELM(Hwy) 2.5705 2.3163 2.0083 1.8765 2.1929 HELM(Res) 2.6684 2.3658 2.0317 1.8846 2.2376

ws = 5

HELM 2.5396 2.2880 2.0341 1.8944 2.1890 HELM(Hwy) 2.5704 2.3447 2.0478 1.9017 2.2161 HELM(Res) 2.6679 2.3717 2.0787 1.9113 2.2574

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

different from that of the original HELM (µ_HELM). The significance value (pvalue) for ws = 0 indicated that only the HELM(Res) for the RT60∈ 0.3 s condition failed to reject the null hypothesis pvalue = 0.02 (≤ 0.05) for the HELMHELM(Res). Nonetheless, HELM(Res) was demonstrated as being significantly better than HELM for RT60∈ {0.3, 0.6, 0.9, 1.2} reverberation condition by providing a very small pvalue for ws = 3, also characterizing better capabilities when compared with HELM(Hwy). Residual networks reformulate the desired transformation with respect to a reference input layer as iden

tity shortcuts that are parameterfree, facilitating better learning capabilities; whereas the highway networks have parameters [130] that may cause overfitting for small amounts of training data, resulting in poor performance compared to HELM(Res). We observed the same findings as reported in [130] by obtaining significant performance improvement for residual networks compared to highway networks. We can also note that among all of the context information ws = 3 achieved the best average PESQ results consistently over the three HELM models. Therefore, we report the results of using ws = 3 in the following discussion.

Ensemble HELM

In this section, we present our results concerning the ensemble HELM frameworks.

Training set 1, namely, 1200 reverberatedanechoic utterance pairs, was employed in this set of experiments. In the offline stage, we built component models based on acoustic knowledge (thus denoted as knowledgebased approach, KB) to split the entire dataset into subsets. Each subset of data was used to train one dereverberation model to charac

terize the mapping function from a specific reverberated condition to the clean condition as described in Section 3.3.5. Here, four component models were trained corresponding

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

to four reverberation conditions (i.e., RT60∈ {0.3, 0.6, 0.9, 1.2}). Subsequently, a fusion model was trained to combine the outputs of the four component models and generate the final dereverberated speech signal that matches the reference anechoic one. In the on

line stage, the input speech was independently processed by each of the four component models. The fusion model then integrated the outputs of the four models to obtain the dereverberated speech.

To confirm the effectiveness of the KB scheme, we designed another comparative data clustering scheme that divided the training data into subsets in a randomsampling (RS) manner, where no knowledge of environment characteristics was involved for data clustering. Based on the RS scheme, four subsets of training data were prepared, and each subset contained 300 reverberatedanechoic utterance pairs by randomly sampling from the entire set of 1200 reverberatedanechoic utterance pairs. Each subset was used to prepare a dereverberation model. Once the four models were trained, a fusion model was estimated. In the online stage, the incoming test utterance was processed by the four component models, and the fusion model integrated the outputs of these four models to generate the final dereverberated speech.

Table 3.3: Average PESQ scores of four HELM frameworks with the RS and KB schemes.

Ensemble Method Testing RT60 0.3 0.6 0.9 1.2 Avg.

Reverb 2.7167 2.1194 1.6155 1.4342 1.9714

RS eHELM 2.4860 2.2296 1.9494 1.8366 2.1254

eHELM(Hwy) 2.5207 2.2954 1.9607 1.8472 2.1560 eHELM(Res) 2.7101 2.3767 2.0331 1.9001 2.2550

KB eHELM 2.5686 2.3155 1.8827 1.7551 2.1305

eHELM(Hwy) 2.6473 2.3365 1.8829 1.7717 2.1596 eHELM(Res) 2.9265 2.4672 1.9471 1.8132 2.2885

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

We used the original HELM, HELM(Hwy), and HELM(Res) to build the ensem

ble and fusion models; the corresponding ensemble frameworks were termed eHELM, eHELM(Hwy), and eHELM(Res), respectively. For all of these ensemble HELM mod

els, we employed the same number of hidden layers and neurons for a fair comparison.

Table 3.3 presents the average PESQ score for the three ensemble HELM frameworks with both the RS and KB clustering schemes. From Table 3.3, we first note that all of the ensemble HELM frameworks with either KB or RS clustering schemes outperformed the Reverb speech with notable margins except for the RT60 = 0.3 s condition, where the Reverb had a better PESQ score than the three ensemble HELM frameworks with the KB clustering. Next, we observe that eHELM performed the worst by exhibiting a lower PESQ score at each RT60 (RT60∈ {0.3, 0.6, 0.9, 1.2}) among the three ensemble HELM frameworks. Moreover, in relatively mild reverberation conditions such as RT60∈ {0.3, 0.6}, all HELM frameworks with the KB clustering scheme achieved better performance than those with the RS clustering. On the other hand, in relatively severe reverberation conditions, i.e., RT60∈ {0.9, 1.2}, all ensemble HELM frameworks with the RS cluster

ing scheme outperformed that of the KB counterparts. For RS and KB clustering schemes, eHELM(Res) illustrated better performance by yielding consistent improvements at each RT60 compared to all other frameworks. In terms of average PESQ scores, the KB clus

tering scheme yielded higher PESQ scores as compared with the RS clustering scheme.

Therefore, in the following discussion, we only report the results of the ensemble HELM frameworks adopting the KB clustering scheme.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Ensemble HELM vs. Existing Approaches

In this subsection, we compare the proposed ensemble HELM frameworks with con

ventional dereverberation approaches. In this set of experiments, we employed training set 2 (3600 reverberatedanechoic utterance pairs, as described in Section 3.4.1) to train the three ensemble HELM frameworks, namely eHELM, eHELM(Hwy), and eHELM(Res).

Two conventional dereverberation approaches were carried out for comparison. The first is called the WuWang approach, which is a twostage speech dereverberation system that adopts inverse filtering and spectral subtraction to handle early and late reverberations [131]. The second approach is a recently proposed coherenttodiffuse power ratio (CDR) estimation [132] method. For this approach, the CDRs between two omnidirectional mi

crophones are estimated for dereverberation using several known CDR estimators. In

Table 3.4: Average PESQ scores of ensemble HELM and IDEA frameworks in the matched testing conditions.

Testing RT60 0.3 0.6 0.9 1.2 Avg.

Reverb 2.7167 2.1194 1.6155 1.4342 1.9714 WuWang 2.5450 2.1511 1.8054 1.6933 2.0487 CDR 2.7507 2.1749 1.8190 1.6981 2.1106 IDEA 2.3712 2.1638 1.8196 1.7047 2.0148 IDEA(Hwy) 2.6314 2.1932 1.8331 1.7355 2.0982 IDEA(Res) 2.7260 2.2277 1.8475 1.7357 2.1342 eHELM 2.4899 2.2724 1.9129 1.7799 2.1137 eHELM(Hwy) 2.6010 2.3674 1.9484 1.7961 2.1782 eHELM(Res) 2.8531 2.4962 1.9943 1.8278 2.2936

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

our sets of experiments, the estimator with the unknown direction of arrival (DOA), and unknown noise coherence was adopted for comparison. In addition to conventional ap

proaches, we compared the learning performance of the ensemble frameworks against a recently proposed neuralbased integrated deep and ensemble learning algorithm (IDEA) [124], which uses DDAE models as the component models and a CNN as the fusion model.

For comparison with the HELM(Hwy) and HELM(Res), we adopted the highway DDAE and residual DDAE as component models, while using CNN as the fusion model; these systems are termed IDEA, IDEA(Hwy), and IDEA(Res), respectively. For the three IDEA systems, we followed the same model architectures as that used in [124] because the total number of training samples in the experiment was comparable to that in [124]. Moreover, the preliminary experiments confirmed that the IDEA architectures achieved very good performance for the TIMIT dataset. Therefore, we decided to use the best architecture of IDEA in [124] as a comparative dereverberation system. Each DDAE framework model in the abovementioned IDEA systems consisted of three hidden layers, with each layer having 2048 hidden neurons; the CNN fusion model consisted of three hidden layers, i.e., two convolutional layers with each layer containing 32 channels, and a fully connected layer with 2048 nodes. The learning rate for the ensemble learning models was set to 0.0002, the same as that used in [108], with a minibatch size of 128. The number of epochs was set to 100. The results of the Reverb, WuWang system, three IDEA systems, and three ensemble HELM systems are reported in Table 3.4.

From Table 3.4, the proposed ensemble HELM frameworks, i.e., eHELM, eHELM(Hwy), and eHELM(Res), notably outperformed both the Reverb and WuWang approaches. Among the three eHELM systems, eHELM(Res) yielded the best performance, confirming the ef

fectiveness of the residual architecture. eHELM(Res) also outperformed all of the IDEA

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

systems consistently over different reverberated conditions.

Ensemble HELMs with More Complex Architectures

By comparing the results in Tables 3.3 and 3.4, we note that the three ensemble HELM frameworks consistently improved when we increased the training utterance pairs from 1200 to 3600. That motivated us to increase the complexity of the component mod

els in the ensemble HELM frameworks and verify whether further improvements could be attained. For all of the HELM models, a relatively more complex architecture of [1000 1000 10000] was used because such a setup gave better results in our previous speech enhancement experiments [41]; the corresponding ensemble HELM models were termed eHELM_D, eHELM_D(Hwy), and eHELM_D(Res). For comparison, we also consid

ered DDAE models with deeper structures in the IDEA framework as those used in [124]:

Each DDAE model had six hidden layers, with each layer having 2048 hidden neurons;

a CNN model consisting of three hidden layers–two convolutional layers with each layer containing 32 channels; and a fullyconnected layer with 2048 nodes. The correspond

ing ensemble IDEA frameworks were termed IDEA_D, IDEA_D(Hwy), and IDEA_D(Res).

We first presented the results of these models tested on matched conditions (namely, the testing conditions consisted of RT60∈ 0.3, 0.6, 0.9, 1.2). Table 3.5 displays the PESQ performance of the IDEA and the ensemble HELM frameworks. Comparing Tables 3.4 and 3.5, we can first note that by using more complex structures, both the IDEA and the ensemble HELM frameworks demonstrated better performance. Moreover, from Ta

ble 3.5, IDEA_D(Res) and eHELM_D(Res) performed the best among the IDEA and en

semble HELM frameworks, respectively, which is consistent with the results reported in Table 3.2, again confirming the effectiveness of the residual structure.

‧

Table 3.5: Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the matched testing conditions.

Testing RT60 0.3 0.6 0.9 1.2 Avg.

IDEAD 2.4485 2.1994 1.8379 1.7282 2.0535 IDEA_D(Hwy) 2.7598 2.2424 1.8553 1.7339 2.1478 IDEAD(Res) 2.8538 2.2808 1.9010 1.7744 2.2025 eHELM_D 2.5379 2.3302 1.9218 1.7842 2.1435 eHELM_D(Hwy) 2.6408 2.3755 1.9509 1.8159 2.1957 eHELM_D(Res) 2.8902 2.5242 2.0177 1.8496 2.3204

In addition to PESQ scores, we reported the average STOI, SRMR, FwSSNR, Cep, and LLR results in Fig. 3.4. Here, we only show IDEAD(Res) and eHELMD(Res) results, as these models achieved better performance for the IDEA and ensemble HELM frameworks, respectively, as shown in Table 3.5. The results of the Reverb, WuWang, and CDR ap

proaches were also listed for comparison. The average results presented in Fig. 3.4 were

STOI SRMR FwSSNR Cep LLR

Reverb

Figure 3.4: Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, WuWang, CDR, IDEAD(Res), and eHELMD(Res) in the matched testing conditions.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

scaled scores to 0 and 1. From Fig. 3.4, we observe that eHELM_D(Res) outperformed the other approaches by providing better speech intelligibility (higher STOI scores) with an average score of 0.7288 compared with Reverb (0.6326), WuWang (0.6081), CDR (0.6968), and IDEA_D(Res) (0.7035) for matched testing conditions. Similarly, the pro

posed eHELM_D(Res) framework maintained a better reverberation suppression by con

tributing a high FwSSNR score and low Cep and LLR scores. The figure demonstrates that the conventional approaches i.e., WuWang and CDR, could attain higher SRMR scores, but they revealed to demonstrate the worst performance on Cep and LLR metrics, which are highly correlated with the quality of the dereverberated speech signals and indicate an overestimation of the reverberation. The overestimation is caused by a suppression due to

在文檔中多層次極限學習機於語音訊號處理上的應用 - 政大學術集成 (頁 73-113)