Database and Experiment Design - FLC-MLLR Adaptation (FCMAP-Like in Form)

6. Incremental MLLR Speaker Adaptation by Fuzzy Logic Control

6.2 FLC-MLLR Adaptation (FCMAP-Like in Form)

6.3.1 Database and Experiment Design

The experiments involve (1) the establishment of initial SI models, (2) the training phase for fixing hyperparameters of the FLC and (3) the recognition phase for performance evaluation on the tuning of  weight by the FLC (FLC-MLLR) in Section 6.2.

An 8 kHz sampling rate was set for speech signal acquisition. The analysis frames were 30-ms wide with a 20-ms overlap. For each frame, a 24-dimensional feature vector was extracted, which was made up of a 12-dimensional mel-cepstral vector and a 12-dimensional delta-mel-cepstral vector.

The initial models which were used as the speaker independent models were constructed using the database, MAT400 sub-database DB3 [119]. The details of the establishment of the initial SI models as a set of HMM parameters are entirely the same as the aforementioned adaptation experiments in Section 4.2.1.

The training data used for tuning the hyperparameters of the FLC were collected from 15 speakers in the training phase. From each of the 15 speakers, 10 utterances of city names (picked among 30 cities) were requested as adaptation data, and then 60 utterances for all 30 cities (two utterances for each) as FLC parameter tuning data; all utterances were recorded by an ordinary microphone. For readability and clearness, the training phase experiment procedure is described in the pseudo-code sequence below.



F0 baseline recognition rate; t = 0;

Repeat

{ t ++;



F₂t 2_utterances_training (SI_models, hyperparameters);



F₄t 4_utterances_training (SI_models, hyperparameters);



F₆t 6_utterances_training (SI_models, hyperparameters);



F₈t 8_utterances_training (SI_models, hyperparameters);



F₁₀t 10_utterances_training (SI_models, hyperparameters);



 

 ⁱ

t i t

F ;

1





F^t F^t F^t ;

} until F^t < threshold;

where 2i_utterances_training(．), i 1, 2, 3, 4, 5 is the procedure using 2i adaptation utterances from 15 speakers for fixing the 9 hyperparameters of FLC defined in Section 6.2 and thus returning a better-than-baseline overall recognition rate F₂^t__i for the 15 training speakers, as is explained in the code-like sequence below.

i

2 _utterances_training (SI_models, hyperparameters) // i 1, 2, 3, 4, 5.

{ k = 0;

 

F2 i baseline recognition rate;

Repeat { k ++;

 

F₍₂i₎₁ speaker_training (SI_models, test_data1, hyperparameters,

i

2 _utterances1);

．．．

F₍^k₂__i₎_j  speaker_training (SI_models, test_dataj, hyperparameters,

i

2 _utterancesj);

．．．

 

F₍₂i₎₁₅ speaker_training (SI_models, test_data15, hyperparameters,

i

2 _utterances15);

1 ) 2 ( 2



 

  ^j

k j i k

F ;

F₂__i  F₂_^k_i F₂^k__i^¹ ; } until F₂__i < threshold 1;

return F₂^k__i; };

where 2i _utterancesj and test_dataj denote respectively the adaptation utterances in the number of 2, 4, 6, 8 and 10 for MLLR estimate of W_s and the 60 test utterances from the jth speaker, 1 j15, for the tuning of the 9 hyperparameters in the proposed FLC mechanism.

And speaker_training(．) is the procedure that would incrementally adapt the SI models by appropriate settings of the hyperparameters of the T-S FLC, as already described in Section 6.2, such that the adaptation would not jeopardize the recognition rate, given 2i utterances.

speaker_training (SI_models, test_dataj, hyperparameters, 2i_utterancesj) // j = 1,…, 15.

{

Estimation_of_W_s (2i_utterancesj);

i j 

F₍₂ ₎ Iterative_process (SI_models, test_dataj, W_s, hyperparameters);

// as described in Section 6.2 for maximizing the recognition rate F₍₂__i₎_j.

return F₍₂__i₎_j; };

As a result, a set of FLC hyperparameters {a , ₁ a , ₂ a₃, b , ₁ b , ₂ b₃, N , ₁ N and ₂ N3} was determined.

In the recognition phase, a group of 15 speakers that are entirely different from the previous group was recruited and again each being requested 10 and 60 utterances for adaptation and recognition respectively. The weight  is calculated by using the hyperparameters acquired in the training stage for adaptation. For comparison, full transformation matrices were used for standard MLLR, MAPLR and the proposed FLC-MLLR. Since the amount of adaptation data was very small, only one common regression matrix tying all states was used for MLLR to make the most efficient use of the data available for adaptation, and MAPLR and FLC-MLLR used one single regression matrix of their own too. The prior densities required by MAPLR were derived directly from the SI models alone. For the recognition experiment with FLC-MLLR adaptation, five adapted models were constructed using 2, 4, 6, 8 and 10 adaptation utterances from each of the 15 speakers, and the  for each of the 5 adaptation will be calculated by Eq. (6-4) with N_utterances = 2, 4, 6, 8 or 10 and the FLC hyperparameters were already determined in the training phase. 5 MLLR-adapted and 5 MAPLR-adapted models respectively using 2, 4, 6, 8 and 10 adaptation utterances were also constructed for performance comparison. Then 60 utterances from each of the 15 speakers were fed into the five adapted models for respective recognition rate evaluation.

6.3.2 Experiment Results

During the training phase, some experiment results and observations were acquired. It is observed that the weight  decreases as the number of adaptation utterances increases. As depicted in Fig. 6.6,  drops by a noticeable step when the number of utterances increases from 2 to 4, and then declines gradually, somewhat stabilized, as the number of utterances increases further.

It is seen that the tendency of the curve of  in FLC-MLLR is very similar to

the one derived from the precursory investigation (Fig. 6.3). Both decrease quickly before the number of adaptation utterances reaches 4 and then fall progressively to a stable value around 0.47 when more and more adaptation utterances are available.

Fig. 6.6. The curve of the training values of  in FLC-MLLR adaptation.

In addition, recognition performance comparisons with various numbers of adaptation utterances were made among the proposed FLC-MLLR utilizing a T-S FLC, the conventional MLLR without exploiting prior knowledge of the initial SI model, and the MAPLR with the prior density derived directly from the initial SI model alone. As shown in Fig. 6.7, it is observed that FLC-MLLR is better than MAPLR and MLLR for all cases, especially when training data are quite limited. It is also worth noting that the performance of MLLR falls below the baseline when 2 utterances were available for adaptation, indicating an improper adaptation may be worse than none at all. All three methods demonstrate improved recognition rate, and MAPLR tends to catch up FLC-MLLR, when the amount of training data increases.

82 84 86 88 90 92 94 96

2 4 6 8 10

Numbers of utterances for adaptation

Recognition rate (%).

FLC-MLLR MAPLR

MLLR Baseline

Fig. 6.7. The performance curves of FLC-MLLR, MAPLR and conventional MLLR in the recognition testing experiments with different amount of adaptation data.

Finally, the effects of  variation on the recognition performance of MLLR under extreme cases of training data availability are also observed, as shown in Fig.

6.8 and Fig. 6.9 respectively. The former shows that while the training data are scarce, 2 utterances say, the performance would go below the baseline if, for  being a bit less than 0.5, the model adaptation is to be largely determined by the transformation matrix W_s which is very much likely poorly estimated. With increasing  , the influence of W_s on the adaptation will be reduced and the recognition rate is improved as expected. However, when  goes beyond 0.5 and further the performance degrades as if the system in a sense ceases to adapt. On the other hand, when the training data are sufficient, 10 utterances for instance, full advantage of adaptation by W_s should be exploited, by using a small  value, for good performance, as depicted in Fig. 6.9.

Fig. 6.8. Numbers of adaptation utterances = 2 (MLLR testing experiments).

Fig. 6.9. Numbers of adaptation utterances = 10 (MLLR testing experiments).

As the final observation, the computing cost for FLC-MLLR involves the computation of  and W_s. Computing W_s is the same as in standard MLLR estimate.

The overhead of finding  in terms of the number of multiplications can be analyzed through its computation defined by Eq. (6-4).

For N₁  N  N₂,

Thus the computation of Eq. (6-1) is of the same order as computing Eq. (2-22), given that W_s being estimated by MLLR.

6.4 Fuzzy Mechanisms for the Context of Multiple Regression Classes Whenever appropriate, the acoustic model space can be partitioned into a number of subspaces, each being a base class as referred in related works. In such a context, a transformation matrix is to be derived for each base class if in-class adaptation data are available such that a component in the class can be adapted accordingly.

Gales proposed a fuzzy clustering scheme [111] for determining the weight _p in the adaptation below, which is essentially a linear combination of MLLR transformation by matrices associated with every regression class

s P

p s p

s  W 

 



 



 





1

)

ˆ ( , (6-5)

where _p represents the degree of how much _s belongs to the regression class p.

Note that the role and purpose of the fuzzy techniques in Gales’s work is completely different from the FLC mechanism for tuning  in FLC-MLLR herein.

Interestingly enough, Eq. (6-5) could be extended as

s P

p s p s

s     W 

 



 



 













1

)

) (

~ ( , (6-6)

which could reduce to the form of Eq. (6-1) in the context of one regression class adaptation (i.e., p = 1), as is the case considered in the dissertation.

Chapter 7 Audio Event Detection Using Variable-Length Decision Windows

Detecting female screaming in three environments of different acoustic backgrounds was exploited in the research to examine the behavior of an FLC-regulating mechanism embedded in an audio event detection system for decision window length control.

A typical process for audio event detection would feed the stream of audio frames (vectors of extracted acoustic features, that is) into the event classifier by which successive analysis on a pre-determined number of audio frames is conducted and then the decision as to whether an audio event being detected over the associated time span, so called the decision window DW as mentioned in Section 2.3.2, is made. Fig.

7.1 depicts a stream of fixed-length decision windows, each of which covers the same number of audio frames and is thus of the same time span.

1

Ti T_i T_i_₁

) 1 (

1  



_i  _n _i

n f

 f



 



_(i ₁₎₁

fn f_n__(i_₁₎_₁ 



 

frames acoustic

of Stream

…

1

DWi DW_i DW_i_₁

…

Fig. 7.1. DW with fixed-length, each covering the same number of audio frames, n, over the time span.

As is clearly seen in Fig. 7.1, for a fixed-length DW covering n audio frames of

t ms time interval, the process makes a decision of event detection every nt ms, regardless of the auditory situation in the context, which may be calm or tense. A too-long DW might face the concern of real-time response, which is essential to all surveillance and security applications, whereas a too-short one would instead encounter the problem of false alarms against sudden/intermittent acoustic changes in the background, which is equally undesired either.

7.1 Concepts of Short Timeslot Likelihood Difference (STLD)

The idea of variable-sized DW thus arises and is the core of the proposed audio event detection system in this dissertation. The length of the decision window should be small when encountering a somewhat “aurally hot” situation so that decision of event detection could be undertaken at a higher rate and be stretched at “aurally calm”

moments for collecting more audio frames to ensure the reliability and correctness of the detection results. Such a situation-dependent behavior is essential to application where reliable and real-time response is the major concern, for which the fixed-length decision window may not suffice. An FLC mechanism is conceived for this purpose.

The control of the decision window size is governed by an FLC, adjusting the window size by estimating the difference of likelihood scores between targeted audio event and normal acoustic background models over a short time-span. The design of the proposed variable-sized DW in audio event detection will be described in detail in the following sections.

An index STLD (Short Timeslot Likelihood Difference) for governing the length of the decision window in the case of two sound models is devised as follows:

)

| ( log )

| (

log ₂

1 1 1





 



 ^m

i m

i f x

x f

STLD , (7-1)

where ₁ and ₂ are the sound models in consideration, f(x_i |₁) and )

| (x_i ₂

f are given by Eq. (2-34), representing the likelihood of ₁ and ₂ model classification, respectively, for frame x_i.

The rationale behind Eq. (7-1) is that at the beginning stage covering m frames, say, of a decision window, if the class inclination of the frames has clearly exhibited, one term in Eq. (7-1) will be substantially greater than the other. As a consequence, a salient STLD value is acquired, indicating that a narrow decision window would suffice. If the class of the m frames can not be resolved, both terms in Eq. (7-1) would be trivial and lead to an insignificant STLD implying the need of a wider DW in order to collect more frames for classification. Fig. 7.2 illustrates the “phenomenon”

implicated by Eq. (7-1).

…

1

DWi DW_i DW_i_₁

1

STLDi STLD_i STLD_i_₁

(large) (small) (medium)

Fig. 7.2. DWs with variable length governed by STLD (Short Timeslot Likelihood Difference) indices.

7.2 Decision Windows Governed by an STLD-Driven FLC

As already explained, the STLD index can be used as the key to DW size control and, as a result, an FLC dictated by two IF-THEN fuzzy rules is designed accordingly:

Rule 1: If STLD is small,

Then WL is big, Rule 2: If STLD is big, Then WL is small,

where STLD is the input for the FLC and WL, the window length, is the output of the FLC.

Quantitatively, the FLC rule set is transformed into

Rule 1: If STLD is M₁(STLD),

Eqs. (7-3), (7-4) and (7-5) that for STLDSTLD₁, WL is solely determined by )

1(

f , simply the case of Rule 1; whereas for STLDSTLD₂, WL is determined by )

2(

f alone, as is the case of Rule 2.

0 1

)

2(STLD M

)

1(STLD M

STLD1 STLD₂ STLD

Fig. 7.3. Membership functions of the STLD-driven FLC.

The FLC now has six hyper-parameters (a₁, a₂, b₁, b₂, STLD₁ and STLD₂) to be fixed, for which an iterative process is devised as follows

STEP 1: Let STLD₁:STLD₂ 1:3 and give an initial value to STLD₁ in the experiment.

a1= initial value; b₁= 0; k = 0;

F0= event_detection_ rate(WLa₁STLDb₁, training_database);

STEP 2: Estimate the parameters a₁ and b₁ under the condition STLDSTLD₁, wherein M₁(STLD)1, M₂(STLD)0, and

1 1

1 ( )

) (

) ( )

( f STLD a STLD b

STLD M

STLD f

STLD

WL M     

 ,

by using the following pseudo-code sequence:

1 a

a  ; k ++;