• 沒有找到結果。

When we obtain the speech signal, we will not use them directly to recognize a speaker because of its huge computation and messy representation. Hence we must extract the features hidden in the speech signal. So feature extraction is the essential process in speech recognition systems. The popular and useful feature extraction approaches focus on the spectrum of the speech signals, and most of the proposed

Glottal

speaker recognition systems use either the mel-frequency cepstral coefficients (MFCCs) or the linear predictive cepstral coefficients (LPCCs) as feature vectors. MFCCs are calculated based on the energy accumulated in the frequency filter banks whose ranges are decided according to the mel-scale [3]; while LPCCs is depending on the linear predictive coding.

Further, when we extract the feature, some useful modification can be pre-processed.

An example is that we discovered recently there are some papers about source information was used in speaker ID systems [4],[5]. Videos of vocal fold vibration [6]

show large variations in the movement of the vocal folds from one individual to another.

For certain speakers, the vocal folds may close completely, while for others, the folds may never reach full closure. The manner and speed in which the vocal folds close also vary differently across speakers. For example, the cords may close in a zipper-like fashion, or may close along the length of the vocal folds at approximately the same time.

Differences in fold vibration correspond to differences in the time-varying area of the slit-like opening between the folds, referred to as the glottis, and therefore in volume velocity air flow through the glottis. The flow may be smooth, as when the folds never close completely, corresponding perhaps to a “soft” voice, or discontinuous, as when they closed rapidly, giving perhaps a “hard” voice. The flow at the glottis may be turbulent, as when air passes near a small portion of the folds that remains partly open. Turbulence at the glottis is referred to as aspiration when occurring during vocal cord vibration can result in a “breathy” voice. In order to determine quantitatively whether such glottal characteristics contain speaker dependence, we must extract features such as the vocal fold opening or closing, the general shape of the glottal flow and the extent at the vocal folds.

This thesis describes a technique to automatically estimate and model the glottal flow derivative waveform from voiced speech, and uses the parameters for speaker

recognition. A block diagram of the approach is given in Fig. 1-3. Our first goal of estimating the derivative of the glottal flow, rather than the glottal flow itself, stems from the availability of pressure measurements of the speech waveform, pressure being the derivative of volume velocity airflow. Estimation of the glottal flow derivation relies on inverse filtering the speech waveform with an estimate of the vocal tract transfer function.

This estimation is typically performed during the glottal closed phase within which the vocal folds are in a closed position and there is no dynamic source/vocal tract interaction.

Wang et al. [7] and Cummings and Clements [8] perform, for example, a sliding covariance analysis with a one sample shift, using a function of the linear prediction error to identify the glottal closed phase. This method relying on the prediction errors, has been observed to have difficulty when the vocal folds do not close completely or when the folds open slowly. The approach of this thesis estimates the glottal closed phase, relying on a digital simulation method of the vocal tract system [9], uses vocal tract formant modulation which is predicted by Shinji Maeda to vary more slowly in the glottal closed phase than in its open phase and to respond quickly to a change in glottal area. A

“stationary” region of formant modulation gives a closed phase time interval, over which we estimate the vocal tract transfer function; a stationary region is present even when the vocal folds remain partly open. The glottal flow derivative waveform that results from inverse filtering is characterized by the speakers themselves.

To extract 11 vowels’

Phonemes

Speaker Recognition

To simulate the 11 Vowels’

Fig. 1-3 : Block Diagram of Glottal Flow Derivation

After extracting the glottal flow derivation, the features are applied to a speaker verification task using MFCCs feature vectors and a Gaussian Mixture Model (GMM). A speaker model which represents each speaker in the speaker recognition system will be built in the training phase and then be used for speaker matching in the test phase. The modeling approaches are various, including the artificial neural network (ANN) [7],[10], the vector quantization (VQ) [11],[12], the Gaussian mixture models (GMM) [13],[14], the hidden Markov model (HMM) [15],[16],[17] and so on. In 1995, Reynolds demonstrated that the GMM-based classifier works well in text-independent speaker recognition even with speech features that contain rich linguistic information like MFCCs [18]. GMM provides a probability model of the underlying sounds of speaker’s voice. It

uses several Gaussian density functions to model a speaker and each density function has its own mean and covariance. For a feature vector denoted as x , the mixture density for j each speaker is denoted as

( )

=

iM=

( )

j

The density is a weighted linear combination of M component uni-modal Gaussian density each parameterized by a mean vector µris and covariance matrix ∑ . is Collectively, the parameters of a speaker’s density model are denoted as

{

is

}

λ r and maximum likelihood (ML) estimates of the model parameters are obtained by using the expectation maximization (EM) algorithm. Therefore, for an utterance X =

{

X1,...,XN

}

and a reference group of speakers

{

S1,S2,...,Ss

}

represented by models

{

λ12,...λs

}

, the identification is executed by the maximum likelihood classification rule sˆ=argmax1sS p

(

XS

)

which decides who the candidates speaker [19] is.

In the following, we will describe the framework of our proposed speaker recognition system briefly.

First, we choose 11 vowels from MAEDA’s vocal tract system, the 11 vowels are shown in Table 1-1. From the vocal tract simulation of the system, we can calculate the transfer function of each vowel then we can use it as an inverse filtering applied to the corresponding vowel grabbed from input sentences of the TIMIT database. In the next procedure, we choose the MFCCs as our feature since the mel-scale mimics the human hearing which is sensitive to the sound in low-frequency domain. After the feature of each frame has been extracted, we applied the vowels of each speaker to a ML-based GMM speaker verification system to construct a model for each speaker. The glottal flow

derivation method is to enhance the GMM model for considering the overall recognition system and reducing the system error rate. The detail of the MAEDA’s vocal tract system and overall speaker recognition system will be described separately in Chapter 2 and Chapter 3.

Table 1-1 : 11 Vowels with Inverse Filtering

0 1 2 3 4 5 6 7 8 9 10 11

None iy ey eh ah aa ao oh uw iw ew Oe

Organization of Thesis

This thesis is organized as follow: In Chapter 2 we will review MAEDA’s digital simulation method of the vocal-tract system. And in Chapter 3 we will describe the proposed structure of the speaker recognition, including MFCCs, glottal flow derivation with inverse filtering, and the GMM model classifier. We depict the used database and show the experimental results to verify that the glottal flow derivative conveys speaker identity information and the performance of our speaker recognition system in Chapter 4.

Finally, we will give the conclusions of this thesis and the future work in Chapter 5.

2 Chapter 2

Framework of the Vocal-Tract System in Speaker Recognition System

2.1 Introduction

This Chapter first describes qualitatively the properties of the components of glottal flow and its derivative, and then briefly reviews Shinji MAEDA’s theory in simulating the model of vocal tract and associated source/vocal tract interaction, and ends with a glottal flow derivative model for extracting features to be used in speaker recognition.

相關文件