Decision Window of the Classifier - Audio Event Detection

2. Overview on Speech Recognition and Audio Event Detection

2.3 Audio Event Detection

2.3.2 Decision Window of the Classifier

However, in real implementation, Eq. (2-33) is replaced by





And at the end of the recognition procedure, the signal Χ is then classified as one of the two sound classes indicated by sˆ .

2.3.2 Decision Window of the Classifier

The so-called decision window (DW) used for classification is in fact a time period covering a predetermined number of audio frames, within which successive analysis is conducted and then the decision as to whether an audio event is detected over the associated time span is made. For each audio frame, two likelihood scores are computed, the normal and the singular, using Eq. (2-34) based on the two GMM models. Within the decision window, all normal and singular estimates are respectively taken in log-values and accumulated, and whichever greater determines the class of the DW as the normal or the singular, as indicated by Eq. (2-36). In conventional processing, a fixed-length DW (e.g., 0.5 sec., 1 sec., 2 sec., etc.) is set to accumulate the log-likelihood scores of each audio frame [28, 30, 32], as shown in Fig. 2.5 where the number of frames covered in DW time span is thus held constant, n.

……

DW DW DW

……

Fig. 2.5. The conventional fixed-length decision window (DW).

Audio event detection systems using fixed-length decision window are ubiquitously seen, for instance, to detect gun shots [28] or coughing in an office [30].

The use of fixed-length DW, however, is plagued by the problem of window sizing.

The setting of a relatively narrow DW may potentially increase the rate of false alarms in the case of sudden and abrupt fluctuations in the background acoustic condition, and that of a too wide DW may not suffice the need of the real-time response as decisions are made at a long periodicity. An audio event detection system with a variable-sized DW governed by a fuzzy logic controller is proposed in this dissertation to regulate the length of DW according to the recent situation development in the background acoustics so that the system will be always aware of the occurrence of the specific audio event even in presence of a complicated background environment.

Chapter 3 Fuzzy Set Theory and Logic Control

Fuzzy set theory, since its inception in 1965 by Lofti A. Zadeh [81-86], has evolved with tremendous success in depth and breadth on both its theoretical development and applications to practical and difficult problems of various natures.

Fuzzy set theory has embraced (or conversely been embraced by) many well-established mathematic disciplines such as logic/inference, probability/statistics, graph/relation and algebra etc. and resulted in a whole new series of theoretical establishment due to the injection of a new ingredient: fuzziness, by which the gate to a new dimension is opened and associated issues are explored. Because of its capacity of dealing with fuzziness, for which Zadeh had a perhaps the best interpretation of all:

“everything is a matter of degree”, large-scaled applications with inherent nature of uncertainty/ambiguity/imprecision then could be handled with systematic engineering approaches on a rigid theoretic ground. How successful fuzzy set theory has been and will be? Perhaps that decision making in many domains where strategic or operational decisions were used to be made by human domain experts in their professional careers are now given by fuzzy systems of all kinds with confidence says it all. A fine reference by Zimmermann [87] is highly recommended for gaining an overall picture covering the theoretical/technical/application aspects of the development in details or in a grand view.

Strangely enough, despite its original conception in Europe for over four decades, the western academic circle didn’t seem to realize its value until the end of 1980s when the Japanese started, in overnight, advocating “Everything is of Fuzzy and by Fuzzy” in their products of home appliances and industrial controllers. Fuzzy logic

control is in fact merely one among the innumerable applications of fuzzy theory, referring to the use of fuzzy logic operations for the automation of an engineering/technical process, usually small-scaled and man-maneuvered;

temperature control or audio information processing in the author’s case, for instance.

In the following, the intuition behind the fuzzy theory, including the constituent entities, will be introduced. The major components comprising the operational space are as follows [87]:

(1) Fuzzy Set A~ :

}

| ) ( ,

~ {

~ 

 x x x

A A where

  refers to a set of entities of certain attribute like AGE or LOOKING with certain degree of ambiguity in nature; for instance,  may concern the matter of AGE and contains 8 elements VERY YOUNG, QUITE YOUNG, YOUNG, MORE OR LESS YOUNG, MORE OR LESS OLD, OLD, QUITE OLD and VERY OLD, or the matter of LOOKING from UNBEARABLY UGLY to ASTONISHINGLY PRETTY etc.

 ~(x)

A are measures, called membership functions of A~

for giving the degree of the specific attribute for x (i.e. how old/young in terms of a value),defined by

] 1 , 0 [ D : ) (

~ x _ 

A ,

and D_ an interval of scalars or a vector space associated with . D_ may be [0, 130] as far as human lifespan is concerned or [1, 10] when talking about one’s LOOKING.

(2) Operators on Fuzzy Sets:

Operations for conventional crisp sets are extended in a way so that the aggregation, differentiation and other desired operations upon two or more fuzzy

sets could be meaningful or meet the requirements of the applications. For

are quite common and the negation of A~

is very often defined as

NOT A~) 1 A~

( 

   .

The fuzzy operators for generic operations on fuzzy sets are usually referred to as fuzzy connectives, based on which higher levels of logic analysis, inference and reasoning can be realized.

(3) Measure of Fuzziness:

A measure for the fuzziness of the fuzzy set in consideration, ~) (A

Fuzz , is often required, the formulation of which is of course function of ~()

A and application-oriented and preferably possesses properties like

 ~) 0

The issues of applying the fuzzy set to logic control will be briefed in Section 3.2.

3.1 Fuzzy Schemes and Speech Recognition

Fuzzy approaches have been widely applied to the field of speech recognition for

many years, playing a variety of roles from data clustering, logic reasoning, to neural network configuration for speech recognition.

(1) Fuzzy data clustering:

In [88], Bezdek developed a clustering algorithm for improving the weakness of K-Means clustering algorithm, in which fuzzy scheme was exploited to consider the relationship in data attributes. Bezdek’s method later became quite popular and widely known as FCM (Fuzzy C-Means) algorithm. In [89], a revised version of FCM algorithm was used to generate phonetic tied-mixture HMM (FPTM) for reducing the parameter size and improving the robustness of parameter training. In the work by Li et al. [90], the FCM was applied to Mandarin four-tone recognition, where the tone value can be determined by the maximum memberships. Tran et al. presented a generalized fuzzy manipulation using FCM and fuzzy entropy in statistical modeling for speech recognition [91].

Another line of fuzzy data clustering concerns the use of vector quantization (VQ). VQ is a standard technique for quantizing a set of scalars (mathematically the vector components) among which statistical dependencies are to be exploited, if ever exist, for optimal reconstruction levels or steps in coding process; the result of VQ is effectively as data clustering from the perspective of data classification and has been widely employed in high-dimensioned data applications, including speech recognition [92-94]. VQ variants with fuzzy ingredients introduced into the quantization process have been seen for the purpose of speech recognition. In [95], a minimum FVQ error criterion was devised for unsupervised speaker adaptation, which showed that the same recognition accuracy as a supervised speaker adaptation could be achieved by minimizing the overall FVQ errors.

Based on the concept of FVQ, Shikano et al., proposed a fuzzy codebook mapping algorithm to speaker adaptation for mapping from a speaker to a standard speaker

[96]. In addition, in the work by Lin et al., the FVQ technique is embedded in neural network for isolated word speech recognition [97, 98]. In [99], a composite of Multi-Layer Perceptron (MLP) neural network and FVQ was presented.

Compared with MLP-VQ, MLP-FVQ will provide richer information about recognition results, an output vector whose components indicating the relative closeness of each label to the input.

(2) Fuzzy logic and reasoning applications:

Fuzzy logic and reasoning has also been applied to speech recognition recently.

In [100, 101], Zhao and Woo proposed a fuzzy speech recognition approach based on the power distribution pattern of a speech segment using fuzzy logic.

Compared to speech recognition using typical hidden Markov models, the work using fuzzy logic was simpler to implement in real-time recognition systems. In the work by Halavati et al. [102], speech spectrogram was conversed into a linguistic description based on arbitrary colors and lengths, following which, fuzzy measures, fuzzy reasoning and a genetic algorithm were used to describe phonemes, perform the recognition procedure and optimize phoneme definitions, respectively.

(3) Fuzzy neural network applications:

Fuzzy neural network (FNN) that combines both the fuzzy logic and the neural network is frequently seen in speech recognition lately. In contrast to the conventional HMM-based recognition, FNN has the advantages of efficient learning, adaptation and connectionist structure when carrying out speech recognition [103]. In [104], a neuro-fuzzy classifier is designed to perform SI model speech recognition, where the classifier is an MLP model incorporated with fuzzy operations and therefore inherits the strength of both neural networks and fuzzy systems. The work by Kasabov et al. applied FNN to model a

phoneme-based speech recognition system, which acquired quite satisfactory recognition performance [105]. In the study of [106], a variant of FNN, called modular general fuzzy min-max (MGFMM) neural network, was proposed to modify the transfer function of the output layer of general fuzzy min-max neural network (GFMM) for improving the recognition accuracy of speech recognition.

Other related works of speech recognition by FNN can be seen in [107-109]. In the specific area of speaker adaptation, however, the use of fuzzy scheme/mechanism is rarely seen. Lin et al. proposed a speaker adaptation scheme in a perceptron-NN for speech recognition [110], where the fuzzy perceptron approach is applied to generate hyperplanes which separate speech patterns of each class from the others. In particular, speaker adaptation is considered as a procedure of tuning the trained hyperplanes when there is recognition error caused by a new speaker. The work by Lin et al. is thus essentially more of a fuzzy-neural classification of speech patterns in the perceptron neural space, instead of an adaptation scheme as being proclaimed. In addition, Gales applied the fuzzy scheme to MLLR speaker adaptation (Fuzzy-MLLR) to further enhance the classification of regression matrices of MLLR [111], which in nature belongs to fuzzy clustering applications.

Although many fuzzy approaches have been widely used in various sub-areas of speech recognition as mentioned hereinbefore, it has not been seen for the use of FLC in HMM speaker adaptation, or even in speech recognition. Based on the methodology of FLC, a series of speaker adaptation computations under FLC regulation are designed in the dissertation. Basic concepts and architectures of the FLC underlying the main theme of the dissertation are to be introduced in the following sections.

3.2 Fuzzy Logic Controller (FLC)

As mentioned earlier fuzzy logic control concerns the automation of a control process for which the operator’s knowledge/expertise/experience regarding the process control imparted in oral or written form has to be translated so as to fit in the framework of fuzzy logic control, together with other accommodation or extension in the fuzzy set theory specific to this particular application. An excellent book on this subject by Zadeh et al. [112] is highly recommended.

Fuzzifier

Inference Engine

Defuzzfier

Process

Fuzzy Rule Base x

)

(x

)

( y

Closed loop control?

Input

Output Yes

No (If required)

Fig. 3.1. Architecture of a typical FLC.

Fig. 3.1 shows the architecture of a typical FLC and the role of each constituent module and the input/output are described as follows.

(1) Input:

usually signals or quantities of certain attribute in precise magnitudes (e.g., temperature measured in Celsius)

(2) Fuzzifier:

the precise and exact values of the input have to be transformed by the fuzzifier through the use of membership functions such that fuzzy implications like MODERATE, VERY LOW or HIGH could be attached so as to be processed by the next module.

(3) Inference Engine:

performing analysis or reasoning on the input information for making control decision like “PUT ON A LIGHT JACKET”, “TURN ON THE HEATER A BIT MORE” or a conclusion like “THE AUTUMN HAS COME” under the constraints from the Fuzzy Rule Base.

(4) Fuzzy Rule Base:

a representation of the domain knowledge in fuzzy terms, typically in either of the two forms:

and N~

respectively.

(5) Defuzzification:

the decision of control action made in the fuzzy context has to be transformed so that a corresponding exact value such as “open the valve of the heat outlet by 10

%” would be available for sending to the physical world of the process

In designing an FLC, various issues regarding the structure or operation of each module in Fig. 3.1 have to be considered, the mastery of which determines the success of the FLC operation, as are addressed in the following:

(a) Input x:

 how many input signals being required

 scaling of each signal, etc.

(b) Fuzzification:

the number and types of membership functions required (c) Rule base:

 the number of rules

 the number of antecedents, weights and membership functions of the antecedent/

consequence associated with each rule

 the structure of the rule base (d) Inference engine:

 connectives for aggregating antecedents

 inference /reasoning schemes to be employed

 operators for aggregating the consequence of individual rules for generating a

decision

Various types of FLCs have been proposed with variations in the module design considerations. The renowned Mamdani and Sugeno FLC are, for instance, different in consequence design in every individual rules; the former generates the consequence

as a member in the fuzzy set associated with linguistic variables pertaining to, say, control action, whereas the latter produces a consequence in crisp form (e.g., as a scalar function of inputs in the antecedent).

Another issue for FLC design has to do with taking into account from the temporal perspective the potential variations in the process itself, for which the use of time-variant parameters in the FLC design becomes unavoidable; i.e. the FLC is preferable to be adaptive in accordance with the time-varying process. Basically the adaptation can be done by modifying the rule sets or the fuzzy set, resulting in two classes of FLCs, respectively the self-organizing and self-tuning FLC.

3.3 Takagi-Sugeno (T-S) FLC

The Takagi-Sugeno fuzzy model proposed by Takagi and Sugeno has been widely in use since it is conceptually simple and straightforward [113]. This type of fuzzy system was early used in a famous parking control of a model car [114] where an FLC is designed for the task of driving a model car to a designated parking space as shown in Fig. 3.2.

Car

 y

Front Wall

Side Wall

Garage Garage Garage

Fig. 3.2. Sugeno’s FLC for car parking.

The parking FLC by Sugeno was designed with the following specifications.

(1) Three inputs:

 (x, y) for the car position,

  for the car orientation.

Two outputs:

 f for the front wheels angle while driving forward,

 b for the front wheels angle while driving backward.

(2) A rule base:

 18 rules for driving forward in which the antecedents involved x, y and ^, the consequence f is a function of x, y and  too,

 18 rules for driving backward with similar rule forms,

 6 rules for speed control.

Based on which, the control goal is to construct a successive alternation of forward-backward driving actions with appropriate speed and turning such that the car can be properly parked in position. Fig. 3.2 shows two parking trajectories by the FLC which is amazingly similar to those done by human drivers. The performance is of coarse quite encouraging and thus it paves the road for subsequent applications to lots of general control problems with successfulness up to present days.

For a complex system, the T-S fuzzy design procedure presents a systematic framework of fuzzy modeling design for this system. Fig. 3.3 illustrates the design methodology. The system is decomposed into a set of subsystems for which local behaviors are identified by expressing the inputs-out mapping in terms of a fuzzy implication (or rule) where the inputs are specified in the antecedent part and the output as the linear combination of the associated inputs. The overall system output is then a function of the subsystem outputs which could be as simple as of a “linear”

combination, where fuzziness of the system behaviors is to be taken care of in the coefficient handling, or of other more elaborated forms.

Complex System

Takagi-Sugeno Fuzzy Model

Physical Model Local/Subsystem

Input-Output Identification

Fig. 3.3. Designs of Takagi-Sugeno (T-S) fuzzy model.

Through the system decomposition, the system dynamics, which is generally complicated and nonlinear, is captured in a set of linear system models and fuzzy mechanisms are incorporated wherever necessary. The application of T-S fuzzy modeling is thus quite straightforward.

Under the framework of T-S fuzzy model, a generic system can be formulated as a set of fuzzy implications (or rules) together with a system output determined by consequences in the set of implications. And the system representation would be of the form

Rule 1: IF x(1) is A₁¹ and … and x(n) is A _n¹

THEN y¹ a¹₀ a₁¹x(1)...a¹_nx(n),

．．．

Rule i: IF x(1) is A₁ⁱ and … and x(n) is A _nⁱ

THEN yⁱ a₀ⁱ a₁ⁱx(1)...a_nⁱx(n), (3-1)

．．．

Rule l: IF x(1) is A₁^l and … and x(n) is A _n^l

THEN y^l a₀^l a₁^lx(1)...a_n^lx(n),

System output: ,

1 1





 _l

i i l

i i i

w y w

y given that ( ( )),

p x A w

p i p





 (3-2)

for a system of n inputs and l implications. Note that Aⁱ_p, p0,1,...,n, are fuzzy sets and Aⁱ_p(x(n)) denotes the fuzzy values of the membership function associated with

Ap for the input x(n); aⁱ_p, p0,1,...,n, are consequent parameters through which the i-th consequence yⁱ is expressed as a linear combination of n inputs.

The output of this system is a weighted sum of functions. In Eq. (3-2), an interpolation procedure is performed among different linear functions (local models).

Fig. 3.4 depicts the phenomenon of the smooth interpolation of the local models.

Fig. 3.4. System output of T-S fuzzy model in an interpolation form.

y2 ₃

)

1(x

A A²(x)

y

x

)

3( x A

T-S fuzzy model has been seen in the control of the system as complicated as an electric power plant with success [115, 116], and is employed in the author’s research in speaker adaptation schemes and audio event detection, as will be detailed in the next four chapters.

Chapter 4 Speaker Adaptation Based on MAP Estimation Using Fuzzy Controller

As mentioned in Section 2.2.1, MAP adaptation is a kind of direct model adaptation, which attempts to directly re-estimate the model parameters [58].

However, it is noted that MAP adaptation re-estimates only the portion of model parameter units associated the adaptation data, and therefore, MAP adaptation usually needs a large amount of data for adaptation and the performance will be improved as adaptation data increases and gets covering the model space. When the amount of data is sufficiently large, the MAP estimation yields as good recognition performances as that obtained using maximum-likelihood estimation [55]. As shown in Eq. (2-17),

k k k

k k

k y N

N 



 

 

 

ˆ , (2-17) the MAP estimate of the mean is essentially a weighted average of the prior mean and the sample mean, and the weights are functions of the number of adaptation samples, given that  being fixed. When N_k is equal to zero (i.e., no additional training data are available for adapting the k-th Gaussian), the estimate is simply the prior mean of the k-th Gaussian alone. Conversely, when a large number of training samples are used for the k-th Gaussian (N_k , to be exaggerative), the MAP estimate in Eq.

(2-17) then converges asymptotically to the maximum likelihood estimate, i.e., the sample mean parameter with the k-th Gaussian, y_k.

Now consider the other way round with N_k being fixed, the parameter 

在文檔中模糊邏輯控制於語者調適及音訊事件偵測之參數調適 (頁 45-0)