1. Introduction
1.3 Audio Event Detection
Conventional security, surveillance or remote homecare systems rely heavily, if not exclusively, on the visual information (i.e. data captured by video camera) for detecting specific events in considerations [21-24] through the use of motion tracking- analysis techniques. The similar development is also seen in the field of multimedia retrieval and indexing applications, where video information is the major concern and it is not until recently that audio cues are involved only as an auxiliary role for
detecting certain specific shot in a video sequence [25-27]. Depending solely on visual data as the basis for capturing status/situation development in the context inevitably would be confronted by the limitations inherent in the image acquiring process:
Video camera is an oriented-sighting device and views lying beyond the
camera’s visual angle are therefore “unseen”.
When the scene is in the darkness or over exposure, activities taking place
wherein would become “unseen”.
The scenario like two gangsters threatening of killing each other right in front of
the video camera, both with smiling on the faces, is in fact “unaware of” through
“clearly seen”.
Note that in all these circumstances, acoustic data can act as a complementary source of information for reflecting the auditory aspect of the reality in the context. A further thought in that almost all living creatures that move around in their habitats are equipped with organs for both visual and aural perception would remind us that any security, surveillance or remote homecare system dismissing the use of audio information is effectively a crippled one. And in fact species that can “hear” much better that they can “see” are more than one would have expected; scotopic animals, oceanic mammals and, of course, the moles are only a small group of examples of all.
As a result, audio event detection has been getting a lot more attentions in recent years, and fundamental issues include
(1) Categorization of various kinds of sounds that are to be encountered in daily life, of which the sources may be
artificial: gun shots [28], door opening/closing and glass breaking [29].
human activities: coughing [30], voices under different emotions [31], crying,
talking, walking and running [32], female screaming to be addressed in this
dissertation (Chap. 7).
nature: wildlife activities [33] and ordinary or catastrophic phenomena [34].
Note that the entities of the categorization are not limited to the above three and in each category good and interesting subjects to be explored are virtually unlimited;
“detecting a tiny mouse blowing wind one mile away”, for instance, borrowing from the dialog in an old movie in the early 80’s “Blue Thunder” is just one if the author is allowed.
(2) Internal representation and modeling of a designated type of sound, in order to be differentiated from other sounds and the background acoustics as were done in [32]
and [28], where multi-level or hierarchical tree are utilized for more elaborated audio representation of several human activities and different types of gunshots, respectively.
(3) Representation and modeling of the background acoustics against which the compasison can be done for audio event detection, as was done in [35] for background noise analysis, and in [36] for online adaptation in background modeling where the idea of acoustic background modeling is translated from a precedent counterpart in video background modeling [37].
A typical audio event detection process starts with receiving a stream of audio frames coming in the system at regular time intervals, on which analysis is to be performed every time a fixed number of frames are collected (or equivalently an elapse of a pre-determined time span called decision window, DW) so as to decide if the designated audio event has occurred or not. The author proposes a variable-length decision window of which the window length is governed by a fuzzy mechanism for eliminating the deficiency suffered by the fixed-length DW approaches, as to be detailed in Chap. 7.
The rest of the dissertation is organized as follows. In Chap. 2, an overview of
automatic speech recognition based on hidden Markov models for Mandarin is given, together with the mathematic backgrounds for the two speaker adaptation techniques in popular use: MAP-VFS composite and MLLR. Also described in Chap. 2 is audio event detection based on Gaussian mixture models. In Chap. 3, a general framework of fuzzy logic control is described, where the problem formulation by fuzzification, the establishment of fuzzy rule base and inference mechanism, and the defuzzification for final quantitative outputs are provided.
The main theme of this dissertation concerns the enhancement of extant speaker adaptation schemes by additional tuning according to the availability of adaptation data and of audio event detection by scanning the audio stream with a variable-sized decision window, both being govern by a general fuzzy mechanism; the formulation and implementations of which are explained respectively in chapters 4, 5, 6 and 7.
And the concluding remarks of the research by the author are given in Chap. 8.
Chapter 2
Overview on Speech Recognition and Audio Event Detection
In the realm of man-machine interactions, audio processing no doubt receives far less attention than it deserves when compared to the resources/efforts invested in its counterpart of video processing. It has been so for over decades despite the fact that for thousands of years in human history instant and precise communications among individuals were mostly realized via the auditory channels: speak and listen (imagine the age before the creation of characters in ancient civilization).
Though the ability to understand what others are talking about is indispensable in social interactions, the auditory perceptual skill of differentiating one kind of sound from others that may be heard in one’s living surroundings is far more important and crucial; for instance, being able to tell other’s “HELLO” from the noises due to a vehicle’s hard break, a gunshot from an explosion, a duck’s quack from a goose’s honk and the Spanish from the Italian without really understanding both languages etc.
could be live-saving or at least useful or even amusing in one’s daily life.
Paradoxically enough, the development in audio process evolved in the opposite order:
speech recognition was addressed far ahead of audio event detection which was not until recent years did it become visible on the stage.
Before the analysis on the audio information could commence, a pre-processing on the input audio signals is generally required for extracting acoustic features in preparation of any particular application under consideration, and in the case of this dissertation, speech recognition and audio event detection. As illustrated in Fig. 2.1,
several major steps in the front-end processing is briefly explained as follows [38, 39]:
Analog to digital conversion (A/D conversion)
Pre-emphasis Feature extraction
Framing processing Speech signals
input
Hamming windowing
Feature vector output
Fig. 2.1. The front-end processing procedure in preparation of subsequent audio analysis.
(1) A/D conversion:
The analog input data is converted into digital forms by sampling and A/D conversion.
(2) Pre-emphasis:
Components in the high-frequency band are enhanced.
(3) Framing:
Samples of audio data are divided into frames, each consisting of a pre-determined and same number of samples.
(4) Hamming windowing:
Discontinuity at the boundary of two consecutive frames is smoothed.
(5) Feature extraction:
From each frame, various parameters are extracted and a feature vector representing acoustic characteristics of the audio input in the associated time period is thus derived.
For the purpose of speech recognition and human voice related application, linear predictive coefficient (LPC) parameters, LPC cepstrum (LPCC) parameters and mel frequency cepstral coefficient (MFCC) parameters are the three most frequently seen in practice, and are employed in the author’s research.
In this chapter, theoretical backgrounds for two fundamental technical issues in speech recognition, namely HMM speech modeling and speaker adaptation, will be given; also given in the final are certain primary issues pertaining to audio event detection.
2.1 HMM Speech Modeling
The modeling of speech patterns can be implemented in the form of neural networks (NN, [40-42]), by using support vector machine (SVM, [43, 44]) or by using hidden Markov models (HMM) which to the author’s knowledge is by far the most popular and widely used one.
2.1.1 HMM and Mandarin Syllable Modeling
HMM is basically a stochastic process operating on an underlying Markov chain of a finite number of states and the same number of random functions: at any given instance of time, the process stays at a certain state and the random function associated with the current state determines what the next state will be. Such issues as how an HMM is to be cast into a model for certain specific applications and how the model parameters are to be estimated are addressed in [45-47] and in practice a state
probability transition matrix is used to describe the probability of going from one state to the other states, which in effect defines the Markov chain at work. The applications of HMM to speech recognition can be found in many references. [48-51] are some of the examples. The work by C. H. Lin et al. [50] is particularly note worthy, where a framework for the recognition of syllables was established and later became a widely accepted standard in the modeling of Mandarin syllables with tones. According to which each Mandarin syllable consists of an initial part and a final ending part, each being called as a sub-syllable. The HMM modeling of Mandarin syllables assumes that the initial part is right dependent on the beginning phone of the following final part and the final part is context independent. A Mandarin utterance may contain one to several syllables; the HMM of an utterance thus includes HMMs of the constituent syllables. In the actual implementation of the author’s work, the HMM of a syllable consists of an HMM of 3 states for the initial part and an HMM of 6 states for the final part, and in total there are 440 states for all Mandarin sub-syllables. The HMM modeling of the initial sub-syllable in 3 states and the final sub-syllable in 6 states are respectively depicted in Fig. 2.2 and Fig. 2.3, where each circle represents a state and
P represents the probability density function concerning the transition from state i to ij
state j. The HMM model employed in the author’s research is referred to as left-to-right model since only left-to-right transitions are allowed; i.e. the transition from each state is limited to only two alternatives: either moving toward the right-hand side neighbor or staying at the current state.
1 2 3
P11
P12
P22
P23
P33
Fig. 2.2. 3-state HMM model for the initial sub-syllable.
5 4
1 2 3 6
Pii
Pij
Fig. 2.3. 6-state HMM model for the final sub-syllable (i = 1 ~ 6).
2.1.2 Estimation and Decoding of HMM
Mathematically, a hidden Markov model can be represented by the parameter set )
, , ( A B
. The underlying Markov chain of N states S1,S2,...,SN can be specified by an initial state distribution vector (1,2,...,N) and a state transition probability matrix A
aij |1i,jN
, in which i is the probability ofSi at time t 0 and a is the state transition probability of going from state ij Si to state S . Moreover, if the observations composed of j M discrete symbols
oM
o
o1, 2,..., are considered, the finite set of probability distributions
b q j N q M
B j( )|1 ,1 with bj(q) being the probability of observing o q given the state S , represents the random processes associated with the states. j Usually, to characterize an HMM the decision of the number of states N and the
number of observation symbols M also should be taken into account besides specifying the parameters , A and B.
In order to acquire an efficient estimation of HMM model during the training phase and an optimal decoding procedure of the estimated HMM model during the recognition phase, three problems need to be taken care of [51]:
(1) If the observation sequence O
o1,o2,...,oT
is given, how the probability )| (O
p is to be evaluated then?
(2) If an HMM model and an observation sequence are known, how the optimal (the most likely) state sequence in the model that produces the observation is to be
For the first problem, some methods such as the forward recursive algorithm and backward recursive algorithm have been proven to be efficient [52]. For the third problem, the Baum-Welch method [45-47] is proposed to offer a local maximum solution although the computation for an explicit solution of the model is difficult.
For the second problem, the Viterbi algorithm proposed in [53] has been proven to be an effective one for acquiring an optimal state sequence. The score function t(i) is defined as in Eq. (2-1), given the observation sequence O
o1,o2,...,oT
This iterative procedure is essentially a dynamic programming and the state sequence
that has the maximum likelihood of generating the given observation sequence will be searched if one keep track of all the states which maximize Eq. (2-1). An array t( j) is used to store the predecessor state of the state j at t . The steps of the Viterbi algorithm are as follows
(1) Initialization During the recursive step of this algorithm, the optimal sequence of states is obtained eventually.
2.2 Speaker Adaptation
Automatic speech recognition systems generally can be classified either as speaker-independent type (SI) or speaker-dependent type (SD), depending on how speech samples are colleted during system construction. An SI system typically collects speech samples from an as large population of speakers as possible, whereas a SD system collects a large amount of sample data from possibly just one designated
speaker. In general, a well-trained SD model achieves better performance than an SI model on recognizing the speech of a specific speaker. However, when the amount of training data available to acquire the SD model is not sufficient, such superiority would no longer exist. This is where speaker-adaptive techniques (SA), sometimes referred to as model-based adaptation techniques, get in to play, which would adapt a full SI model into an SD one and achieves SD-like performance, requiring only a small fraction of the speaker-specific training data. When a new speaker uses such an adaptive system, the parameters of the HMMs are updated by speech data obtained from this speaker. By speaker adaptation, the recognition performance can be significantly improved for outlier speakers such as non-native speakers or others not well represented in the SI training set.
Generally speaking, the operation for speaker adaptation can be carried out in either supervised mode or in unsupervised mode respectively, depending on if the transcription of the speaker-specific adaptation data has been known or not before performing the adaptation procedure [54]; the speaker adaptation is said to operate in batch mode if all adaptation data acquired from a new speaker is fed into the system before the final adapted system is produced and then put to work, or incremental mode if the adaptation data is continually fed for adaptation while the system is already at work [54].
Currently there mainly three categories of speaker adaptation techniques:
(1) Maximum a posteriori (MAP) adaptation, representative of Bayesian-based adaptation.
(2) Maximum likelihood linear regression (MLLR) adaptation, representative of transformation-based adaptation.
(3) Eigenvoice adaptation.
Before the advent of eigenvoice approach in 2000, MAP and MLLR adaptation
are the most commonly used techniques for speaker adaptation, and practically are still seen working in almost all speech recognition systems nowadays. The schemes of the three speaker adaptation will be described in the following subsections.
2.2.1 Bayesian-based Adaptation
In early 90s, Lee, Lin and Juang reported speaker adaptation for an HMM with parameters of continuous density (CDHMM) [55], in which the parameter estimation was accomplished by segmental k-means algorithm which was developed in their earlier researches for HMM parameter estimation/training [56, 57]. In these works, speaker adaptation of CDHMM parameters is formulated as a Bayesian learning procedure, where prior information were involved in the computation of Bayes theorem P(|O) where is the model parameters and O is the sequence of observations. On this basis, Gauvain and Lee then released in 94 the MAP adaptation by maximum a posteriori estimate of the HMM parameters [58]. MAP adaptation is thus Bayesian-based and offers a framework of incorporating newly acquired speaker-specific data into the existing models.
Assume that the CDHMM parameters are characterized by the parameter vector
wik ik ik
, ,
, where wik, ik and ik are the mixture gain, mean vector and covariance matrix of the k-th mixture component from the i-th state, respectively. The parameter vector is a random vector. A prior knowledge about the random vector is available and characterized by a prior probability density function p() where is to be determined as the input sequence is observed. Let Y (y1,...,yT) be a given set of T observations. The MAP estimate for is defined as
)].
| ( [ max
arg p Y
MAP
(2-10)
Then the MAP estimate for is obtained by solving
Then Eq. (2-10) can be rewritten as follows:
a r gm a x[ ( |) () ] .
p Y p
M A P (2-13)
To accomplish the estimation of the model parameter vector , the well-established segmental k-means algorithm can be used, and the execution is done in an iterative process as follows:
(1) Obtain the optimal state segmentation of a given observation sequence Y, based on a given model, i.e.,
(2) Based on the optimal state sequence sˆ , find the MAP estimate a r gm a x ( ,ˆ|) ()
P Y s P
. (2-15) (3) Iterates from (1) until some predefined equilibrium is reached.
Assume that the mean is random with a prior distribution P0() and the variance 2 is known and fixed, then the conjugate prior [59, 60] for is also a Gaussian distribution with mean and variance ~2, as already shown in [61]. And if the conjugate prior for the mean is substituted into Eq. (2-13), the MAP estimate for the adapted parameter as derived in [61] would appear as a weighted average of the prior mean and the mean of the adaptation observation data yk:
where Nk is the total number of training samples observed for the corresponding recognition unit with the k-th Gaussian and yk is the sample mean with the k-th Gaussian.
Let 2/~2 and the prior mean be replaced by the mean parameter of the initial model with the k-th Gaussian, k, Eq. (2-16) could be reformed as
k
k k
k k
k y N
N
N
ˆ , (2-17) where is a parameter which gives the bias between the maximum likelihood estimate of the mean from the data and the prior mean. That is, is a prior density parameter that controls the balance between the prior knowledge and the adaptation data.
Note that, however, the data available for adaptation is often quite limited and most likely could cover a small portion of speech patterns in HMMs, which implies that many HMM parameters will not be adjusted by the nature of Bayesian-based adaptation. As a result, vector field smoothing (VFS) was proposed as a supplement for broadening the extent of adaptation in the HMM parameter vector space [62-65].
The rationale behind VFS adaptation is that, by exploiting the spatial coherence of vector distributions in HMM, the unadapted HMM parameter vector might be
“purposely” adjusted in accordance with the MAP adapted vectors nearby.
To be specific, consider an unadjusted parameter vector j and k of MAP
adapted vectors ˆk’s with initial counterparts k’s lying in the vicinity of j in the HMM vector space. The amount of MAP adaptation to k is referred to as the transfer vector k,
k k
k
ˆ . (2-18) Given the adapted vectors around, how much adaptation to j should be
expected? A weighted average of k’s as shown in Eq. (2-19) would be a quite
f denotes the weight control parameter.
A typical VFS adaptation thus comprises three steps:
(1) transfer vectors calculation for all MAP adapted parameter vectors by Eq. (2-18), (2) interpolation of transfer vectors for adapting the unadjusted vector by Eq. (2-19), (3) smoothing.
The composite of MAP-VFS adaptation has been proven to be more robust than MAP adaptation in recognition performance when given the same limited amount of adaptation data. Still there are rooms for MAP-VFS enhancement when the quality of MAP adaptation is in question, which is an issue to be addressed in Chap. 5.
The composite of MAP-VFS adaptation has been proven to be more robust than MAP adaptation in recognition performance when given the same limited amount of adaptation data. Still there are rooms for MAP-VFS enhancement when the quality of MAP adaptation is in question, which is an issue to be addressed in Chap. 5.