6. Incremental MLLR Speaker Adaptation by Fuzzy Logic Control
6.2 FLC-MLLR Adaptation (FCMAP-Like in Form)
6.3.1 Database and Experiment Design
The experiments involve (1) the establishment of initial SI models, (2) the training phase for fixing hyperparameters of the FLC and (3) the recognition phase for performance evaluation on the tuning of weight by the FLC (FLC-MLLR) in Section 6.2.
An 8 kHz sampling rate was set for speech signal acquisition. The analysis frames were 30-ms wide with a 20-ms overlap. For each frame, a 24-dimensional feature vector was extracted, which was made up of a 12-dimensional mel-cepstral vector and a 12-dimensional delta-mel-cepstral vector.
The initial models which were used as the speaker independent models were constructed using the database, MAT400 sub-database DB3 [119]. The details of the establishment of the initial SI models as a set of HMM parameters are entirely the same as the aforementioned adaptation experiments in Section 4.2.1.
The training data used for tuning the hyperparameters of the FLC were collected from 15 speakers in the training phase. From each of the 15 speakers, 10 utterances of city names (picked among 30 cities) were requested as adaptation data, and then 60 utterances for all 30 cities (two utterances for each) as FLC parameter tuning data; all utterances were recorded by an ordinary microphone. For readability and clearness, the training phase experiment procedure is described in the pseudo-code sequence below.
F0 baseline recognition rate; t = 0;
Repeat
{ t ++;
F2t 2_utterances_training (SI_models, hyperparameters);
F4t 4_utterances_training (SI_models, hyperparameters);
F6t 6_utterances_training (SI_models, hyperparameters);
F8t 8_utterances_training (SI_models, hyperparameters);
F10t 10_utterances_training (SI_models, hyperparameters);
5
5
1
2
i
t i t
F
F ;
1
Ft Ft Ft ;
} until Ft < threshold;
where 2i_utterances_training(.), i 1, 2, 3, 4, 5 is the procedure using 2i adaptation utterances from 15 speakers for fixing the 9 hyperparameters of FLC defined in Section 6.2 and thus returning a better-than-baseline overall recognition rate F2ti for the 15 training speakers, as is explained in the code-like sequence below.
i
2 _utterances_training (SI_models, hyperparameters) // i 1, 2, 3, 4, 5.
{ k = 0;
0
F2 i baseline recognition rate;
Repeat { k ++;
k
F(2i)1 speaker_training (SI_models, test_data1, hyperparameters,
i
2 _utterances1);
...
F(k2i)j speaker_training (SI_models, test_dataj, hyperparameters,
i
2 _utterancesj);
...
k
F(2i)15 speaker_training (SI_models, test_data15, hyperparameters,
i
2 _utterances15);
15
15
1 ) 2 ( 2
j
k j i k
i
F
F ;
F2i F2ki F2ki1 ; } until F2i < threshold 1;
return F2ki; };
where 2i _utterancesj and test_dataj denote respectively the adaptation utterances in the number of 2, 4, 6, 8 and 10 for MLLR estimate of Ws and the 60 test utterances from the jth speaker, 1 j15, for the tuning of the 9 hyperparameters in the proposed FLC mechanism.
And speaker_training(.) is the procedure that would incrementally adapt the SI models by appropriate settings of the hyperparameters of the T-S FLC, as already described in Section 6.2, such that the adaptation would not jeopardize the recognition rate, given 2i utterances.
speaker_training (SI_models, test_dataj, hyperparameters, 2i_utterancesj) // j = 1,…, 15.
{
Estimation_of_Ws (2i_utterancesj);
i j
F(2 ) Iterative_process (SI_models, test_dataj, Ws, hyperparameters);
// as described in Section 6.2 for maximizing the recognition rate F(2i)j.
return F(2i)j; };
As a result, a set of FLC hyperparameters {a , 1 a , 2 a3, b , 1 b , 2 b3, N , 1 N and 2 N3} was determined.
In the recognition phase, a group of 15 speakers that are entirely different from the previous group was recruited and again each being requested 10 and 60 utterances for adaptation and recognition respectively. The weight is calculated by using the hyperparameters acquired in the training stage for adaptation. For comparison, full transformation matrices were used for standard MLLR, MAPLR and the proposed FLC-MLLR. Since the amount of adaptation data was very small, only one common regression matrix tying all states was used for MLLR to make the most efficient use of the data available for adaptation, and MAPLR and FLC-MLLR used one single regression matrix of their own too. The prior densities required by MAPLR were derived directly from the SI models alone. For the recognition experiment with FLC-MLLR adaptation, five adapted models were constructed using 2, 4, 6, 8 and 10 adaptation utterances from each of the 15 speakers, and the for each of the 5 adaptation will be calculated by Eq. (6-4) with Nutterances = 2, 4, 6, 8 or 10 and the FLC hyperparameters were already determined in the training phase. 5 MLLR-adapted and 5 MAPLR-adapted models respectively using 2, 4, 6, 8 and 10 adaptation utterances were also constructed for performance comparison. Then 60 utterances from each of the 15 speakers were fed into the five adapted models for respective recognition rate evaluation.
6.3.2 Experiment Results
During the training phase, some experiment results and observations were acquired. It is observed that the weight decreases as the number of adaptation utterances increases. As depicted in Fig. 6.6, drops by a noticeable step when the number of utterances increases from 2 to 4, and then declines gradually, somewhat stabilized, as the number of utterances increases further.
It is seen that the tendency of the curve of in FLC-MLLR is very similar to
the one derived from the precursory investigation (Fig. 6.3). Both decrease quickly before the number of adaptation utterances reaches 4 and then fall progressively to a stable value around 0.47 when more and more adaptation utterances are available.
Fig. 6.6. The curve of the training values of in FLC-MLLR adaptation.
In addition, recognition performance comparisons with various numbers of adaptation utterances were made among the proposed FLC-MLLR utilizing a T-S FLC, the conventional MLLR without exploiting prior knowledge of the initial SI model, and the MAPLR with the prior density derived directly from the initial SI model alone. As shown in Fig. 6.7, it is observed that FLC-MLLR is better than MAPLR and MLLR for all cases, especially when training data are quite limited. It is also worth noting that the performance of MLLR falls below the baseline when 2 utterances were available for adaptation, indicating an improper adaptation may be worse than none at all. All three methods demonstrate improved recognition rate, and MAPLR tends to catch up FLC-MLLR, when the amount of training data increases.
82 84 86 88 90 92 94 96
2 4 6 8 10
Numbers of utterances for adaptation
Recognition rate (%).
FLC-MLLR MAPLR
MLLR Baseline
Fig. 6.7. The performance curves of FLC-MLLR, MAPLR and conventional MLLR in the recognition testing experiments with different amount of adaptation data.
Finally, the effects of variation on the recognition performance of MLLR under extreme cases of training data availability are also observed, as shown in Fig.
6.8 and Fig. 6.9 respectively. The former shows that while the training data are scarce, 2 utterances say, the performance would go below the baseline if, for being a bit less than 0.5, the model adaptation is to be largely determined by the transformation matrix Ws which is very much likely poorly estimated. With increasing , the influence of Ws on the adaptation will be reduced and the recognition rate is improved as expected. However, when goes beyond 0.5 and further the performance degrades as if the system in a sense ceases to adapt. On the other hand, when the training data are sufficient, 10 utterances for instance, full advantage of adaptation by Ws should be exploited, by using a small value, for good performance, as depicted in Fig. 6.9.
Fig. 6.8. Numbers of adaptation utterances = 2 (MLLR testing experiments).
Fig. 6.9. Numbers of adaptation utterances = 10 (MLLR testing experiments).
As the final observation, the computing cost for FLC-MLLR involves the computation of and Ws. Computing Ws is the same as in standard MLLR estimate.
The overhead of finding in terms of the number of multiplications can be analyzed through its computation defined by Eq. (6-4).
For N1 N N2,
Thus the computation of Eq. (6-1) is of the same order as computing Eq. (2-22), given that Ws being estimated by MLLR.
6.4 Fuzzy Mechanisms for the Context of Multiple Regression Classes Whenever appropriate, the acoustic model space can be partitioned into a number of subspaces, each being a base class as referred in related works. In such a context, a transformation matrix is to be derived for each base class if in-class adaptation data are available such that a component in the class can be adapted accordingly.
Gales proposed a fuzzy clustering scheme [111] for determining the weight p in the adaptation below, which is essentially a linear combination of MLLR transformation by matrices associated with every regression class
s P
p
p s p
s W
1
)
ˆ ( , (6-5)
where p represents the degree of how much s belongs to the regression class p.
Note that the role and purpose of the fuzzy techniques in Gales’s work is completely different from the FLC mechanism for tuning in FLC-MLLR herein.
Interestingly enough, Eq. (6-5) could be extended as
s P
p
p s p s
s W
1
)
) (
1
~ ( , (6-6)
which could reduce to the form of Eq. (6-1) in the context of one regression class adaptation (i.e., p = 1), as is the case considered in the dissertation.
Chapter 7
Audio Event Detection Using Variable-Length Decision Windows
Detecting female screaming in three environments of different acoustic backgrounds was exploited in the research to examine the behavior of an FLC-regulating mechanism embedded in an audio event detection system for decision window length control.
A typical process for audio event detection would feed the stream of audio frames (vectors of extracted acoustic features, that is) into the event classifier by which successive analysis on a pre-determined number of audio frames is conducted and then the decision as to whether an audio event being detected over the associated time span, so called the decision window DW as mentioned in Section 2.3.2, is made. Fig.
7.1 depicts a stream of fixed-length decision windows, each of which covers the same number of audio frames and is thus of the same time span.
1
Ti Ti Ti1
t
) 1 (
1
i n i
n f
f
(i 1)1
fn fn(i1)1
frames acoustic
of Stream
…
1
DWi DWi DWi1
…
Fig. 7.1. DW with fixed-length, each covering the same number of audio frames, n, over the time span.
As is clearly seen in Fig. 7.1, for a fixed-length DW covering n audio frames of
t ms time interval, the process makes a decision of event detection every nt ms, regardless of the auditory situation in the context, which may be calm or tense. A too-long DW might face the concern of real-time response, which is essential to all surveillance and security applications, whereas a too-short one would instead encounter the problem of false alarms against sudden/intermittent acoustic changes in the background, which is equally undesired either.
7.1 Concepts of Short Timeslot Likelihood Difference (STLD)
The idea of variable-sized DW thus arises and is the core of the proposed audio event detection system in this dissertation. The length of the decision window should be small when encountering a somewhat “aurally hot” situation so that decision of event detection could be undertaken at a higher rate and be stretched at “aurally calm”
moments for collecting more audio frames to ensure the reliability and correctness of the detection results. Such a situation-dependent behavior is essential to application where reliable and real-time response is the major concern, for which the fixed-length decision window may not suffice. An FLC mechanism is conceived for this purpose.
The control of the decision window size is governed by an FLC, adjusting the window size by estimating the difference of likelihood scores between targeted audio event and normal acoustic background models over a short time-span. The design of the proposed variable-sized DW in audio event detection will be described in detail in the following sections.
An index STLD (Short Timeslot Likelihood Difference) for governing the length of the decision window in the case of two sound models is devised as follows:
)
| ( log )
| (
log 2
1 1 1
m
i
i m
i
i f x
x f
STLD , (7-1)
where 1 and 2 are the sound models in consideration, f(xi |1) and )
| (xi 2
f are given by Eq. (2-34), representing the likelihood of 1 and 2 model classification, respectively, for frame xi.
The rationale behind Eq. (7-1) is that at the beginning stage covering m frames, say, of a decision window, if the class inclination of the frames has clearly exhibited, one term in Eq. (7-1) will be substantially greater than the other. As a consequence, a salient STLD value is acquired, indicating that a narrow decision window would suffice. If the class of the m frames can not be resolved, both terms in Eq. (7-1) would be trivial and lead to an insignificant STLD implying the need of a wider DW in order to collect more frames for classification. Fig. 7.2 illustrates the “phenomenon”
implicated by Eq. (7-1).
…
…
1
DWi DWi DWi1
1
STLDi STLDi STLDi1
(large) (small) (medium)
Fig. 7.2. DWs with variable length governed by STLD (Short Timeslot Likelihood Difference) indices.
7.2 Decision Windows Governed by an STLD-Driven FLC
As already explained, the STLD index can be used as the key to DW size control and, as a result, an FLC dictated by two IF-THEN fuzzy rules is designed accordingly:
Rule 1: If STLD is small,
Then WL is big, Rule 2: If STLD is big, Then WL is small,
where STLD is the input for the FLC and WL, the window length, is the output of the FLC.
Quantitatively, the FLC rule set is transformed into
Rule 1: If STLD is M1(STLD),
Eqs. (7-3), (7-4) and (7-5) that for STLDSTLD1, WL is solely determined by )
1(
f , simply the case of Rule 1; whereas for STLDSTLD2, WL is determined by )
2(
f alone, as is the case of Rule 2.
0 1
)
2(STLD M
)
1(STLD M
STLD1 STLD2 STLD
Fig. 7.3. Membership functions of the STLD-driven FLC.
The FLC now has six hyper-parameters (a1, a2, b1, b2, STLD1 and STLD2) to be fixed, for which an iterative process is devised as follows
STEP 1: Let STLD1:STLD2 1:3 and give an initial value to STLD1 in the experiment.
a1= initial value; b1= 0; k = 0;
F0= event_detection_ rate(WLa1STLDb1, training_database);
STEP 2: Estimate the parameters a1 and b1 under the condition STLDSTLD1, wherein M1(STLD)1, M2(STLD)0, and
1 1
1 1
1
1 ( )
) (
) ( )
( f STLD a STLD b
STLD M
STLD f
STLD
WL M
,
by using the following pseudo-code sequence:
1
1 a
a ; k ++;
Fk= event_detection_ rate(WLa1STLDb1, training_database);
if (Fk Fk1) Repeat
{a1a1; k ++;
Fk= event_detection_ rate(WLa1STLDb1, training_database);
} while (Fk Fk1);
else
Repeat
{a1a1; k ++;
Fk= event_detection_ rate(WLa1STLDb1, training_database);
} while (Fk Fk1);
1
1 b
b ; k ++;
Fk= event_detection_ rate(WLa1STLDb1, training_database);
if (Fk Fk1) Repeat
{b1b1; k ++;
Fk= event_detection_ rate(WLa1STLDb1, training_database);
} while (Fk Fk1);
else
Repeat
{b1b1; k ++;
Fk= event_detection_ rate(WLa1STLDb1, training_database);
} while (Fk Fk1);
return Fk;
In the pseudo-code sequence, the rate of correct detection returned by
event_detection_ rate(WL ,) is defined as best recognition rate too.
STEP 5: Update STLD1 such that STLD1:STLD2 1:3. Repeat from STEP 2 until the settings of a1 , a2, b1 , b2 , STLD1 and STLD2 can not further maximize the system performance over the training dataset.
7.3 Experiments
The experiments were to detect female screaming in three environments of different acoustic backgrounds: the office space, the parking lot and the living room.
7.3.1 Experiment Designs
In the training phase, three GMM models for “office space”, “parking lot” and
“living room” were built as backgrounds using 10-minute recording in each environment. The recording was undertaken at 8K Hz sampling rate, from which LPC, LPCC and MFCC were extracted for each 20 ms frame (consisting of 160 samples, i.e.). Note that a 12-D LPC, a 12-D LPC/mel cepstrum and a 12-D delta cepstrum were utilized. Three GMM models for “female screaming” in each of the three environments were also built using two-thirds of a 180-second (60 sec. for each environment) recording from each of a group of 15 female subjects for extracting the same set of 3 acoustic features; the subjects were requested to scream in every possible way they could during the recording.
The rest one-third of the screaming data (20 sec. for each environment and totally 900 sec. for all 15 females in all the three environments) was used for FLC parameter-tuning as previously described.
In the event detection testing phase, an entirely new group of 15 females was recruited for the screaming recording of 60 sec. each (20 sec. for each of the three environments).
7.3.2 Experiment Results
During the testing phase, the GMM classifier with the proposed FLC-regulated DW was put to detect audio events occurring in a background audio stream of 15 minutes in length. Three experiments were conducted in “office space”, “parking lot”
and “living room” respectively, and several observations on the effectiveness of the proposed approach are presented in tabulation for comparison, as are briefed below.
(1) Table 7.1 shows that, using LPC alone, the approach exploiting variable-sized DW governed by FLC achieves an average of 95%, 93.5% and 92% accuracy for event detection in the three testing contexts respectively, where the window size varies between Wmin and Wmax, with an average of Wavg. With LPC alone, Table 7.2 shows the performance of the fixed-length DW scheme with a variety of fixed DW settings, from 0.5 sec. to 5 sec. at an increment of 0.5 sec., and in all cases the accuracy is inferior to the scores in Table 7.1. It is further noted that, against the variable-sized DW, the fixed DW reaches competitive scores of 91% at 3-sec. WL, 93.33% at 2.5-sec. WL and 95% at 1.5-sec. WL, respectively in the three testing contexts: the settings of DW fall within the corresponding ranges of DW variation [Wmin,Wmax] associated with the FLC-regulated DW operation.
(2) Similar observations from the case of using LPCC alone are also made, as shown in Table 7.3 and 7.4.
(3) Table 7.5 and 7.6 present the experiment results in the case of using MFCC feature with the same observations.
(4) In the experiment, auditively the noisiest background is the living room (where family members exchanged conversation while children chasing/playing around, with TV set turned on aloud), followed by the parking lot, and then the office space.
Such a phenomenon seems to be reflected by the range of WL variation, ]
, [Wmin Wmax
WR , when the STLD -driven FLC operated in the three contexts. To be specific,
WR (office space) < WR (parking lot) < WR (living room), regardless of whichever of the three acoustic features used.
(5) For all the testing in the 3 backgrounds, MFCC leads to the best performance in audio event detection, LPCC the second and LPC the third, regardless of whichever control scheme on DW size being taken, as shown in Figs. 7.4, 7.5 and 7.6.
Table 7.1. Event detection by an FLC-regulated DW, using only LPC feature.
Variable-sized DW
Living room Parking lot Office space
Wmin. 3.12 sec. 2.23 sec. 1.12 sec.
Wmax. 3.96 sec. 2.88 sec. 1.58 sec.
Wavg. 3.55 sec. 2.56 sec. 1.33 sec.
Accuracyavg. 92.00% 93.50% 95.00%
Table 7.2. Event detection by fixed-length DW, using only LPC feature.
DW length Living room Parking lot Office space
0.5 sec. 80.83% 83.33% 91.67%
1 sec. 81.67% 86.00% 93.33%
1.5 sec. 84.00% 87.00% 95.00%
2 sec. 86.00% 91.33% 94.67%
2.5 sec. 89.33% 93.33% 95.00%
3 sec. 91.00% 93.00% 95.00%
5 sec. 91.67% 93.33% 95.00%
Average 86.36% 89.62% 94.24%
Table 7.3. Event detection by an FLC-regulated DW, using only LPCC feature.
Variable-sized DW
Living room Parking lot Office space
Wmin. 3.18 sec. 2.31 sec. 1.15 sec.
Wmax. 3.98 sec. 2.92 sec. 1.63 sec.
Wavg. 3.57 sec. 2.61 sec. 1.36 sec.
Accuracyavg. 93.50% 95.00% 97.00%
Table 7.4. Event detection by fixed-length DW, using only LPCC feature.
DW length Living room Parking lot Office space
0.5 sec. 83.67% 87.50% 92.50%
1 sec. 87.67% 90.00% 94.67%
1.5 sec. 89.00% 90.50% 96.50%
2 sec. 90.67% 93.33% 96.67%
2.5 sec. 91.67% 95.00% 96.67%
3 sec. 93.00% 95.00% 96.00%
5 sec. 93.33% 95.00% 96.67%
Average 89.86% 92.33% 95.67%
Table 7.5. Event detection by an FLC-regulated DW, using only MFCC feature.
Variable-sized DW
Living room Parking lot Office space
Wmin. 3.15 sec. 2.18 sec. 1.17 sec.
Wmax. 3.92 sec. 2.91 sec. 1.68 sec.
Wavg. 3.52 sec. 2.55 sec. 1.41 sec.
Accuracyavg. 95.00% 98.50% 98.50%
Table 7.6. Event detection by fixed-length DW, using only MFCC feature.
DW length Living room Parking lot Office space
0.5 sec. 84.50% 88.33% 93.33%
1 sec. 88.33% 90.67% 95.33%
1.5 sec. 90.50% 91.50% 98.00%
2 sec. 91.33% 94.67% 98.00%
2.5 sec. 92.50% 98.33% 98.33%
3 sec. 94.00% 98.00% 98.00%
5 sec. 95.00% 98.33% 98.33%
Average 90.88% 94.26% 97.05%
Fig. 7.4. Living room audio event detection.
Fig. 7.5. Parking lot audio event detection.
Fig. 7.6. Office space audio event detection.
Chapter 8
Conclusions and Future Works
In the following, the major contributions of the author’s work and some findings and observations of experiment results with FLC mechanisms are briefly summarized.
In addition, some plausible developments in the future along the line of current researches are also mentioned.
8.1 HMM Speaker Adaptation with FLC
The quality of HMM speaker adaptation relies greatly on the amount of adaptation data acquired from the new speaker, be it an MAP or MLLR adaptation. It would be desired that the adaptation from either MAP or MLLR estimate to the prior distributions of HMM should be restricted when the adaptation data is limited, and adapts fully when the opposite occurs.
During the MAP estimate and the associated VFS process that follows up, the adaptation is governed by
k
respectively in their original forms.
The author thus introduces FLC mechanism for the tuning of and f based on the following considerations.
and f 1 should be depressed in a certain way when ample adaptation data is at hand, and be enhanced otherwise, which is expected to adapt the HMM model without deteriorating the recognition performance even when the acquired data from the speaker is scarce.
and f 1 should be depressed in a certain way when ample adaptation data is at hand, and be enhanced otherwise, which is expected to adapt the HMM model without deteriorating the recognition performance even when the acquired data from the speaker is scarce.