時域-頻域上的聽覺頻譜平滑化之強健性語者辨識

(1)

國立交通大學

電信工程研究所

碩士論文

時域-頻域上的聽覺頻譜平滑化之

強健性語者辨識

Spectro-temporal Smoothed Auditory

Spectra for Robust Speaker Recognition

研究生：林廷翰 Student: Ting-Han Lin

指導教授：冀泰石博士 Advisor: Dr. Tai-Shih Chi

(2)

時域-頻域上的聽覺頻譜平滑化之

強健性語者辨識

Spectro-temporal Smoothed Auditory Spectra for

Robust Speaker Identification

研究生：林廷翰 Student: Ting-han Lin

指導教授：冀泰石博士 Advisor: Dr. Tai-shih Chi

國立交通大學

電信工程研究所

碩士論文

A Thesis

Submitted to Institute of Communication Engineering

College of Electrical and Computer Engineering

National Chiao-Tung University

In Partial Fulfillment of the Requirements

for the Degree of

Master of Science in

Communication Engineering

July 2010

Hsin-Chu, Taiwan, Republic of China

(3)

時頻上的平滑聽覺頻譜之語者辨識

學生：林廷翰指導教授：冀泰石博士

國立交通大學電信工程研究所

中文摘要

傳統使用的語者辨識系統，辨識率很容易受到加成性雜訊及摺積性雜訊干擾，這是由於傳統上使用的特徵參數只有表達出語句最低層的線索，而較高層的線索被證實出對雜訊較具有抗雜性。本篇論文利用聽覺模型抽取出的語音特徵參數，和時域-頻域調變特性來處理並補捉較高層的線索，最後應用於雜訊下的語者辨識。本論文使用文句不限定及封閉集合語者辨識系統，使用 TIMIT 和 GRID 語料庫進行測試，而實驗結果顯示所提出的參數在各個 SNR 環境下，辨識率比傳統的 MFCC 參數大大提升；而時域-頻域調變濾波器與最近提出的 ANTCC 相比，在低 SNR 下有優異的表現。

(4)

ii

Spectro-temporal Smoothed Auditory Spectra for

Robust Speaker Recognition

Student: Ting-han Lin Advisor: Dr. Tai-shih Chi

Institute of Communication Engineering

National Chiao-Tung University

English Abstract

The performance of conventional speaker recognition systems is severely compromised by interference, such as additive or convolutional noises. High-level information of the speaker is considered more robust cues for recognizing speakers. This paper proposes an auditory-model based spectral features, auditory cepstral coefficients (ACCs), and a spectro-temporal modulation filtering (STMF) process to capture high-level information for robust speaker recognition. Text-independent closed-set speaker recognition experiments are conducted on TIMIT and GRID corpora to evaluate the robustness of ACCs and benefits of the STMF process. Experimental results show ACCs’ significant improvement over conventional MFCCs in all SNR conditions. The superior performance of STMF to newly developed ANTCCs is also demonstrated in low SNR conditions.

(5)

誌謝

能夠完成這篇論文，首先感謝兩年來諄諄教誨的冀泰石教授，老師就像一個寶庫，我念碩士的這兩年，從老師身上學到許多的知識寶藏，老師不僅在學業研究上耐心指導，也會在做人處事、個性態度上給我們建議，把我們每個學生從裡到外磨成一塊玉，讓我能脫胎換骨，很開心能夠成為您的學生☺。感謝實驗室的學長們大師、阿郎及 NICK 在課業上及研究上的指點；同屆的藍霙、勝哥、大樹及禮偉一起研究奮鬥，一同玩樂與磨練；學弟妹華山、雞排、文中及靖雯給實驗室帶來新的歡樂；IT LAB 很 nice 的小玄子、谷嶸及 nino；語音 LAB 的學長們江振宇、楊智合及黃信德，很親切地幫了許多忙。也謝謝許多的同學們及朋友們的幫助。另外謝謝女朋友 pico，在我最煩躁乏味時陪伴我，用開心的笑容給我鼓舞為我打氣。最後感謝我的父母一直默默的在背後當我的後盾，才能順利的走到今天的成就，爸媽我愛你們，也感謝小妹的陪伴，讓宅宅的老哥開心地過這兩年。謝謝大家陪伴我這兩年來，生命因你們更添光亮。廷翰 2010 年夏

(6)

Chinese Abstract

... i

English Abstract

... ii

Acknowledgement

... iii

List of Figures

... vi

List of Tables

... vii

Chapter 1 Introduction

... 1

1.1 Introduction

... 1

1.2 Motivation

... 3

1.3 Outline of this thesis

... 6

Chapter 2 Speaker Recognition Systems

... 7

2.1 Introduction to Speaker Recognition Systems

... 7

2.2 Gaussian Mixture Models

... 9

2.3 Maximum A Posteriori Adapted Gaussian Mixture Models

... 14

Chapter 3 Auditory Model and Features

... 17

3.1 The Motivated Use of Auditory Model

... 17

3.2 Cochlear Module and Auditory cepstral coefficients

... 19

3.3 Cortical Module and Spectro-temporal Modulation Filtering

... 23

Chapter 4 Evaluation

... 30

4.1 Database and Evaluation Measurements

... 30

(7)

4.2.1 Results in GMM ... 32

4.2.2 Results in MAP-GMM ... 38

4.3 Discussions

... 44

Chapter 5 Conclusion and Future Works

... 46

(8)

List of Figures

FIGURE 1-1 Structure of Speaker Recognition System. ... 2

FIGURE 1-2 A summary of features from viewpoint of their physical interpretation. 3 FIGURE 2-1 Speaker recognition systems. ... 8

FIGURE 2-2 LBG algorithms. ... 11

FIGURE 3-1 Hearing pathway. ... 18

FIGURE 3-2 The anatomy of the ear. ... 19

FIGURE 3-3 The basilar membrane diagram and the characteristic frequency at the basilar membrane. ... 20

FIGURE 3-4 The firing rate of auditory neuron. ... 20

FIGURE 3-5 Stages of the early cochlear. ... 21

FIGURE 3-6 An example of moving ripple stimulus. ... 24

FIGURE 3-7 The response for 8 modeled neurons in the cortex. ... 25

FIGURE 3-8 Rate-scale representation from the A1 module. ... 26

FIGURE 3-9 Auditory spectrograms and Rate-scale representations of cleen speech and white noise. ... 27

FIGURE 3-10 Noise suppression by STMF. ... 29

FIGURE 4-1 The 6 characters in GRID corpus. ... 31

FIGURE 4-2 Average 0~15dB recognition rates (in %) for GRID corpus.. ... 32

FIGURE 4-3 Average recognition rates (in %) of 70 people in TIMIT corpus. ... 35

FIGURE 4-4 Average recognition rates (in %) of GRID corpus. ... 37

FIGURE 4-5 Different mixture UBM. ... 39

FIGURE 4-6 Different training sentence. ... 39

FIGURE 4-7 Average recognition rates with various adaptations (in %) . ... 40

FIGURE 4-8 Average recognition rates (in %) of 70 people. ... 43

(9)

List of Tables

Table 1 Correct recognition rates (in %) with different STMF (δ, α) parameters under various SNRs of the pink noise. ... 33 Table 2 Correct recognition rates (in %) of 70 people in TIMIT corpus. ... 34 Table 3 Correct recognition rates (in %) of GRID corpus. ... 36 Table 4 Correct recognition rates (in %) of 70 people in TIMIT corpus. (UBM stands for MAP-GMM)... 41 Table 5 Correct recognition rates (in %) of GRID corpus. ... 42

(10)

Chapter 1 Introduction

1.1 Introduction

In modern lives, identification authentication technologies are commonly used in entrance security systems as well as in portable electronic devices by entering the password or having the magnetic/ID cards scanned to confirm the personal identification. With progresses of the science and technology, any unique characteristic of the human body, such as fingerprints, retinas, facial figures, the voices and so on, is studied and used in advanced identification authentication systems. These biometric verification/recognition technologies are very powerful against intruders who steal passwords or ID cards from authenticated users.

In daily lives, people usually have no problem in recognizing the caller only from his/her voice through communication channels. This is a perfect example of a naive speaker recognition task that people perform everyday. Speaker recognition algorithms have been developed over the last few decades. The basic block diagram of the speaker recognition system is shown in Figure 1-1. Such system consists of three main modules: the feature extraction, speaker models and the recognizer. Basically, feature parameters extracted from input speech are compared with stored models, which are built during training processes, of registered speakers. The recognition decision is made according to certain similarity measures.

Since the characteristics of voice and speaking style are different among people, speaker recognition systems are constructed to recognize speakers by using these features. Conventional approaches adopt short-time spectral features such as Mel-frequency cepstral coefficients (MFCCs), linear predictive cepstral coefficients (LPCCs) and perceptual linear predictive (PLP) coefficients to model each speaker.

(11)

These features basically capture smoothed spectral profiles, which reflect vocal tract information of speakers, and usually yield high recognition rates in clean or matched test conditions [1]. However, the recognition performance is often significantly degraded in mismatched testing conditions where speech is deteriorated by either convolutional or additive noises. Therefore, the robustness of speaker recognition systems has drawn a lot attention from researchers. Just like in Nokia’s slogan: technology always stems from human nature, the most natural way for humans to verify the identity of a person is by using their sensory inputs, such as from vision (face recognition) and/or hearing (speaker recognition). Our purpose is to develop a robust speaker recognition system, which mimics what humans do, and hopefully to make the world more convenient.

(12)

1.2 Motivation

Speech is roughly characterized into low-level information and high-level information from the viewpoint of its physical exposition, as shown in Figure 1-2. The low-level information corresponds to characteristics of individual’s vocal system, while the high-level information corresponds to characteristics of individual’s vocabulary from his cultural or schooling experiences. It is clear that conventional MFCCs or LPCCs only catch the low-level vocal-tract information. On the other hand, the language-dependent speaking rate is considered as a high-level feature. High-level features are believed to be more robust, but less discriminative among speakers in clean environments [2]. It has been shown in [3] that the speaker recognition accuracy can be improved by fusing high- and low-level features.

FIGURE 1-2

A summary of features from viewpoint of their physical interpretation. (ACLCLP-vol. 15, no. 5)

(13)

Automatic speaker recognition is a tough problem due to the mismatch between handsets and/or channels (convolutional noises) and environmental noises (additive noises), which are two of the most prominent factors to the recognition rate [4]. It has been shown that almost perfect recognition is achievable for clean and well-matched speech. Therefore, researchers have focused on the problems of transducer mismatches and robustness over past years.

To deal with the handset/channel mismatch, linear and nonlinear compensation techniques have been proposed, with applications to feature, model and score domains [5]. The feature compensation is to remove the handset/channel effect on the feature, such as the cepstral mean subtraction (CMS) [6], RASTA [7], discriminative feature design [8], feature mapping [9], and various feature transformation methods such as feature warping [10] and short-time Gaussianization [11]. The score-domain compensation aims to remove handset-dependent biases from the likelihood ratio scores. The most prevalent compensations include the H-norm [12], Z-norm [13], and T-norm [14]. The model-domain compensation is to adapt model parameters to match different handsets/channels. It involves modifying the speaker model parameters instead of the feature vectors. Examples of the model-domain compensation methods include the speaker model synthesis [15], maximum a posteriori (MAP) [16], combination of MLLR and PDBNN [17].

To tackle the environmental noise robustness problem, many speech enhancement techniques have been proposed, for example, spectral subtraction [18], Kalman filtering [19]. They are all based on forming a statistical estimate for removing additive noise. However, noise estimates are never perfect, which may result in removing not only the noise but also speaker-dependent components of the original speech. Other techniques focus on noise compensation, such as

(14)

combination [21].

Lippmann demonstrated that human hearing is very robust against any noises in any recognition tests [22]. Presumably, human hearing analyzes every aspect of sounds (low- and high-level) to reduce the noise impact upon the accuracy of recognition tasks. Therefore, it is natural to include psycho-acoustical and neuro-physiological findings about human hearing in the development of speech processing systems to enhance their performance. Recently, several novel features based on human hearing have been proposed and utilized in speaker recognition systems [23]–[25].

In this study, we investigate auditory spectral features and combining with spectro-temporal features for speaker recognition tasks in additive noise environments. The spectral features are referred to as the auditory cepstral coefficients (ACCs), which are derived from the auditory spectrum in a way similar to MFCCs. In addition, high-level constraints are enforced by a spectro-temporal modulation filtering (STMF) process embedded in the auditory model. The two-stage auditory model produces a two-dimensional auditory spectrogram for any input speech, and then analyzes spectro-temporal amplitude modulations of the auditory spectrogram [26]. While low-level spectral features are well preserved in the auditory spectrogram, certain high-level features, such as the speaking rate, are embedded in the spectro-temporal modulations of the speech. Therefore, adopting spectro-temporal modulation features extracted by the auditory model shall enhance the robustness of a speaker recognizer, hopefully, like people have.

(15)

1.3 Outline of this thesis

The remainder of this thesis is organized as follows. Chapter 2 describes the auditory perceptual model and speaker recognition systems. Our proposed method would be presented in Chapter 3. Simulation results by using proposed features in noisy conditions are demonstrated in Chapter 4. We end in Chapter 5 with conclusions and future works.

(16)

Chapter 2 Speaker Recognition Systems

In this chapter, we briefly review the speaker recognition systems. The most commonly used approaches in speaker recognition systems are the Gaussian Mixture Model (GMM) [1] as well as the Maximum A posteriori Adapted Gaussian Mixture Model (MAP-GMM) [27]. We introduce these two models in section 2.2 and 2.3.

2.1 Introduction to Speaker recognition systems

As human beings, we are able to recognize someone just by hearing his or her voice. Usually, a few seconds of speech are sufficient for humans to identify a familiar voice. Similarly, the automatic speaker recognition systems employ computational algorithms to recognize humans by their voices.

Speaker recognition systems can be divided into two different tasks: speaker identification and speaker verification tasks, as shown in Figure 2-1. Speaker verification, or speaker authentication, is the computational task of deciding whether a speech utterance is delivered by a claimed speaker or not. More formally, it is the task of deciding, given a speech signal x and a hypothesized speaker S, whether x was spoken by S. This is referred to as the one-to-one decision. On the other hand, there is no a priori identity claim in the speaker identification task. Speech from an unknown speaker is compared against trained speech models of N known speakers, and the best matching speaker is reported as the recognition decision. This is referred to as one-to-N decision.

Speaker recognition tasks can be further categorized into text-dependent and text-independent tasks. The difference between these two tasks is whether using the same utterance in training and testing procedures. In the text-dependent task, the

(17)

utterance is known to the recognition systems beforehand. Undoubtedly, the text-independent task is more flexible. Furthermore, speaker identification task can include closed- and open-set tasks. In the former case, the identification system chooses the best matching speaker from trained models - no matter how poor this match is. In the latter case, a predefined tolerance level is considered to prevent the wrong recognition. Speech Feature Extraction Speaker Model 1 Speaker Model 2 Speaker Model M Select Max Identified Speaker

(a) Identification System.

Speech Feature Extraction Target Speaker Model Background Model ∑ + -L reject L accept L , , Θ < Θ ≥ (b) Verification System.

(18)

In this thesis, we tackle the text-independent closed-set speaker identification problem. The state-of-the-art closed-set speaker modeling methods are the Gaussian Mixture Model (GMM) and the Maximum A posteriori Adapted Gaussian Mixture Model (MAP-GMM) (which is also named Gaussian Mixture Model Universal Background Model (GMM-UBM)). In the following section, we introduce two commonly used statistical modeling methods for estimating parameters of the GMM.

2.2 Gaussian Mixture Model

The Gaussian Mixture Model (GMM) is a stochastic model, which is composed of a finite number of mixtures of multivariate Gaussian components, to fit an observed probability density function (PDF). Since the GMM can fit arbitrary shapes of PDF of features from speaker’s voice and its training is simple, fast and giving good performances, it has become the default reference method in any speaker recognition systems.

A GMM, denoted byλ , is characterized by its probability density function:

1

(

)

(

, )

M i i i i i

p x

λ

w N x

μ

=

∑

Σ

K

K JK

(2-1)

where x is D-dimensional feature vector; M is the number of Gaussian mixtures; w _i

is the prior probability (mixtures weight) of the i-th mixture with constrain 1 1 M i i w = =

∑

; and 1 , _{/ 2} 1/ 2 1 1 ( ) exp ( ) ( ) 2 (2 ) T i i i i i _M i i N x μ x μ x μ π − ⎧ ⎫ Σ = _⎨− − Σ − _⎬ ⎭ Σ ⎩ K JK K JK K JK , i =1...M, is the i-th

Gaussian density function with D

×

1 mean vector μi JK

and D

×

D covariance matrix

i

Σ . For numerical and computational reasons, we use only diagonal covariance matrices in this thesis. In general, estimating parameters of a full-covariance GMM

(19)

requires much more training data and is computationally expensive. And empirical evidence shows that diagonal matrix GMMs can perform equally well or out-perform the full matrix GMMs [27].

Training a GMM is to estimate the parameters

{

_,

}

1 , M i i w_{i i} λ μ = = Σ from a given collection of training vectors. The basic approach uses Vector Quantization (VQ) to get the initial parameters, and then the maximum likelihood (ML) model parameters are estimated via the expectation-maximization (EM) algorithm.

Vector Quantization (VQ) is one of the most efficient and useful methods in source-coding techniques. The disorderly speech feature vector distribution can be classified into codewords by VQ. The following states the LBG algorithm [28] which is commonly used in VQ.

LBG algorithm

• 1. Set M (number of mixtures) =1. Find the centroid of all feature vectors. • 2. Split M into 2M partitions. As shown in equation (2-2), where δ =0.01.

(1

)

(1

)

i i i i

μ

δ

μ

δ

+ −

=

+

=

−

JK

(2-2)

• 3. Use K-means iterative algorithm to re-classify feature vectors and find the new centroid of each partition until old centroids equal to new centroids. • 4. Repeat step 2 and 3 until meet the desired total number of mixtures. • 5. Calculate variance σ2i

JK

and weightw . As shown in equation (2-3), where n _i

is the total feature number; ni is the feature number in the i-th partition;σ_ik2 is

(20)

2 2

(

_jk _ik

)

j ik i i i

x

n

w

n

μ

σ

−

=

∑

, calculated x_jk which belongs to the centroid μ_ik

(2-3) Find centroid m=1 Split each centroid m=2*m Classify vectors Find centroids

New cetroid=Old centroid

m<M NO

YES

YES _STOP

FIGURE 2-2

LBG algorithms.

Although VQ can classify the speech feature vectors by centroids, it can’t describe the size and spatial shape of the speech feature vector distribution of each partition. As a result, the ML estimation via the EM algorithm for model parameters is used after VQ. For a set of i.i.d. feature vectors x={ ,x x1 2,...,xT}

K K K

, the ML estimate of parameters of a GMM, is:

(21)

(

) ( )

arg max (

) arg max

( )

arg max (

)

ML

p x

p

x

p x

λ

=

K

_K

K

(2-4)

Therefore, the ML estimation is to find the best λ to achieve the highest probability. The ML parameter estimation can be accomplished iteratively via the EM algorithm. The basic idea of the EM algorithm is, beginning with initial parameters

{

_,

}

1 , M i i w_{i i} λ μ =

= Σ obtained from VQ, to estimate a new model λˆ provided p x(K λˆ)≥ p x(K λ). The new model λˆ then becomes the initial model for the next iteration and the process is repeated until some convergence criterion is met. Generally speaking, five to ten iterations are sufficient for parameters convergence. 1. E-Step:

Q

(

λ λ

k

)

=

E

[log (

p x

λ

) ,

y

λ

k

]

2. M-Step:

arg max (

Q

k

)

λ

λ λ

Given a training feature vector set x={ ,x x1 2,...,xT}

K K K

, the following re-estimation formulae, which guarantee a monotonic converge to a local maximum, are used in each EM iteration: Posteriori probability: 1

(

)

(

, )

, i=1...M

(

)

t i i t M j t _j j

w N x

p i x

w N x

λ

=

∑

K

JJK

(2-5) Weight: 1

1 ˆ

(

, )

T t i t

w

p i x

T

₌

λ

=

∑

K

(2-6)

(22)

Mean: 1 1

(

, )

ˆ

(

, )

T t t t i _T t t

p i x

x

p i x

λ

μ

λ

= =

⋅

=

∑

K

(2-7) Variance: 2 2 1 2 1

(

, )

ˆ

(

, )

T t t t i _T i t t

p i x

x

p i x

λ

σ

μ

λ

= =

⋅

=

∑

−

∑

K

(2-8)

To avoid the variance converging to zero, a variance floor of 0.000001 is adopted. After estimating each speaker’s model parameters{ ,λ λ1 2,...,λS}

JK JK JK

, the speaker identification decision is made according to the probability measures as follows:

1

speaker ID arg max (

)

k k S

p x

λ

≤ ≤

=

K

(2-9)

(23)

2.3 Maximum A Posteriori Adapted Gaussian Mixture Model

Conventional GMMs trained by the EM algorithm as depicted in Section 2.2 perform well when a large amount of training data is available to characterize speakers. In other words, training a GMM model for a particular speaker needs data from all his possible pronunciation. However, in the real world, this condition is somehow not feasible. Adaptations of the acoustic models have been studied for solving this problem. One successful adaptation approach, namely the Universal background model-maximum a posteriori (UBM-MAP) approach, has been widely used in text-independent speaker verification tasks in recent years.

The UBM is a large GMM which is usually set with 256~2048 mixtures depending on the size of the training data. Lower order mixtures are often used in applications with constrained speech (such as digits or a fixed vocabulary), while 2048 mixtures are used with unconstrained speech (such as the conversational speech). This approach firstly pools a huge amount of speech data gathered from a large number of background speakers to train a universal background model (UBM) by the LBG and EM algorithm. Unlike the standard approach of maximum likelihood training of a model for a particular speaker, independently with the UBM, this adaptation approach is to derive the speaker’s model by adapting the well-trained UBM parameters to a speaker model λ using this speaker’s training speech via the MAP estimation technique. For a UBM and training vectors from the hypothesized speaker, x={ ,x x1 2,...,xT}

K K K

, the MAP estimate of parameters of the speaker GMM is:

(

) ( )

arg max (

) arg max

( )

arg max (

) ( )

MAP

p x

p

x

p x

p

λ

=

K

_K

K

(2-10)

(24)

Like the EM algorithm, the adaptation is a two step of the EM algorithm. The specifics of the adaptation are as follows. Given training vectors from target speaker

1 2 { ,x x ,...,xT} = x K K K and a UBM

{

_,

}

1 , M UBM i i w_{i i}

λ = μ Σ ₌ , we first compute the posteriori probability (p i xt,λ_UBM)

K

for the mixture i in the UBM:

Posteriori probability: 1

(

)

(

, )

, i=1...M

(

)

t i i M t j t _j j

w N x

p i x

w N x

λ

=

∑

K

JJK

(2-11)

We then use (p i xt,λUBM) K

and xt

K

to compute the sufficient statistics for the mixture weight, mean, and second moment parameters:

Weight: 1

(

, )

T t i t

n

p i x

λ

=

∑

K

(2-12) Mean: 1

1 ( )

(

, )

T t t t i i t

E x

p i x

x

w

₌

λ

=

∑

⋅

K

(2-13) Second moment: 2 2 1

1 (

t

)

T

(

t

, )

t i i t

E x

p i x

x

w

₌

λ

=

∑

⋅

K

(2-14)

(25)

This is the same as the expectation step in the EM algorithm. Finally, these new statistics from the training data are used to update the old UBM statistics for mixture i to create the adapted parameters for mixture i with the equations:

2 2 2 2 2

ˆ

[

/

(1

) ]

ˆ

( ) (1

)

?

(

) (1

)(

)

i i i i i t i i i i i t i i i i i i i

w

n T

w

E x

α

γ

μ α

α μ

σ

α

α σ

μ

=

+ −

=

+ −

=

+ −

+

−

K

(2-15)

where the scale factor γ is computed over all adapted mixture weights to ensure their summation equals unity and the adaptation coefficients α_i controlling the balance between the old and new estimates. It is defined as:

i i i

n

r

α

=

+

(2-16)

where r is a fixed relevance factor which can be viewed as an adaptation coefficient. If r is large, adaptation is slow and if r is small, adaptation is fast. We set r = 16 in this thesis. In [27], only adapting the means shows the best performance in simulations. The performance of adapting the means and variances is similar to the performance of adapting the means only. After adaptation, the mixture components of the adapted GMM for each speaker retain a correspondence with the mixtures of the original UBM.

(26)

Chapter 3 Auditory Model and Features

The speaker recognition system is described in Chapter 2. This chapter provides a brief review of the auditory model, which contains an early cochlear (ear) and a central auditory cortex (A1) module, proposed by Shamma et al. [26]. The auditory spectral features are extracted from the early cochlear module and used in the speaker recognition simulations. The Spectro-Temporal Modulation Filtering (STMF) is performed by the cortical module and produces cleaned spectral features. This approach was first investigated by Hung [29] in digit recognition tasks.

3.1 The Motivated Use of Auditory Model

In recent years, there is an increasing interest in adopting properties of human hearing perception for speech-related applications to overcome various types of distortion such as additive noises, convolutional noises and degradations from channel mismatch. It has been shown that human hearing is very robust against any noises in any recognition tests [22].

For instance, perceptual linear predictive (PLP) coefficients [30], which are one of the most used coefficients, embed two hearing perception properties: the equal loudness pre-emphasis and the intensity-loudness conversion. To compensate the fact that humans have non-equal hearing thresholds at different frequencies, the speech power spectra are multiplied by the magnitude response of an equal loudness pre-emphasis filter. The intensity–loudness conversion addresses the non-linear relation between the intensity of the sound level and the perceived loudness. And it has been shown that PLP is more robust to noise than the LPCC (linear predictive cepstral coefficients).

(27)

Here, we evaluate the performance of a noise suppression algorithm, which works on the internal perceptual representation of an auditory model, in text-independent speaker identification tasks and compare its robustness to other features or algorithms. The auditory model is inspired by psycho-acoustical and neuro-physiological findings along the mammal’s hearing pathway: the cochlea and the cortex. Figure 3-1 shows a schematic plot of the auditory pathway. In the following sections, we introduce these two stages (cochlear and cortical stages) and related functions in the auditory model.

FIGURE 3-1

Hearing pathway.

(28)

3.2 Cochlear Module and Auditory cepstral coefficients

FIGURE 3-2

The anatomy of the ear.

(http://www.advcoch.com/I2_Hearing_Physiology.htm)

The ear can be divided into three parts – the outer ear, middle ear and inner ear, which are shown in Figure 3-2. The inner ear consists of the cochlea, which is composed of three chambers with full lymph, as shown in the top left panel of Figure 3-3. The basilar membrane (BM) dividing the scala media and the scala tympani plays a significant role in hearing. After the mechanical vibration reaches the oval window, a traveling wave is generated and propagates along the basilar membrane. Different locations of the BM achieve the maximum responses with respect to traveling waves with different frequencies. The right panel of Figure 3-3 shows the responsive frequencies along the basilar membrane. The inhibitions between neighboring frequencies produced by the traveling wave might be the main cause of the well-known “frequency masking” phenomenon in audition.

The traveling wave generates displacement along the BM, and then the hair cells residing along the basilar membrane transform the displacement to sensory nerve action potentials. There are two different hair cells: inner hair cells and outer hair cells. Most of the transformation from mechanical vibrations to electrical potentials is done

(29)

by inner hair cells, which connects with the auditory nerve. Due to the fact that a relaxation time is needed between consecutive firings of neurons, firing rates can not keep up with high frequency inputs, as demonstrated in Figure 3-4. The firing rate of the auditory nerve is bounded by 4-5k Hz and the rate of the midbrain is bounded by about 1k Hz.

FIGURE 3-3

The basilar membrane diagram (left) and the characteristic frequency at the basilar membrane (right). (Hearing Physiology Handout, AAIP)

FIGURE 3-4

The firing rate of auditory nerve correspond to the monotone audio input. (Hearing Physiology Handout, AAIP)

(30)

f f f H(f;f1) H(f;f2) H(f;f128) Acoustic Signal y1

-+ + + + + y2 ACC DCT Log

∫

y4 y3 u g(u)

Figure 3-5 depicts the early cochlear module and the derivation of auditory cepstral coefficients (ACCs). The speech signal is first filtered by a set of 128 overlapping asymmetric constant-Q filters whose magnitude responses can be expressed as: ( )

(

)

, 0

( )

0,

h f f h h h

f

e

f

H f

f

β α − −

⎧

₋

_{≤ ≤}

⎪

= ⎨

>

⎪⎩

(3-1)

where f is the cut-off frequency (in log-frequency axis) and _h α=0.3 and β=8. These

cochlear filters evenly distribute over 5.3 octaves with 24 filter/octave frequency resolution. The cochlea filters represent the selectivity of the basilar membrane to different frequencies.

The output of each filter is then passed through a lateral inhibitory network (LIN), a half-wave rectifier and a lowpass filter. The LIN is implemented by a first order differentiator along the log-frequency axis to roughly account for the frequency masking effect between neighboring neurons. It equivalently sharpens the frequency response of each cochlear filter. The half-wave rectifier combined with the following lowpass filter extract the envelope of filtered speech in each cochlear band.

(31)

Outputs at various stages can be stated as follow: 1

( , )

i

( )

t

( ; )

i

y t f

=

s t

∗

h t f

(3-2) 2

( , )

i f 1

( , )

i 1

( , )

i 1

( ,

i 1

)

y t f

= ∂

y t f

=

y t f

−

y t f

₋ (3-3) 3

( , ) max( ( , ),0)

i 2 i

y t f

=

y t f

(3-4) 4

( , )

i 3

( , )

i t

( ; )

y t f

=

y t f

∗

μ τ

t

(3-5)

where ( ; )h t f_i is the impulse response of the i-th constant-Q filter with center

frequency f , i=1…128 ; _i ∗ is the convolution in the time domain; and the _t

integration window ( ; ) t/ ( ) t e τ u t

μ τ = − ⋅ models the current leakage along the neural pathway to the auditory cortex.

The output y t f is referred to as an auditory spectrogram which captures ₄( , )_i spectro-temporal envelopes of input speech along the log-frequency and the time axes. Similar to the derivation of conventional MFCCs, the Auditory Cepstral Coefficients (ACCs) are obtained by the discrete cosine transform (DCT) on the logarithm amplitude of the auditory spectrum at any time instant. The ACCs A(k,t) k=0…N-1, can be written as:

4 1

2 ( , )

N

ln

( , ) cos(

_i

(

0.5))

i

k

A k t

y t f

i

N

π

=

∑

−

(3-6)

Intuitively, ACCs represent smoothed auditory spectra, which reflect vocal tract information of the speaker, along human’s hearing pathway.

(32)

3.3 Cortical Module and Spectro-temporal Modulation

Filtering

The second cortical module is inspired from neural activities of the auditory cortex (A1) to different spectro-temporal variations. Such spectro-temporal variations are encoded in two parameters: rate and scale. The rate (or velocity) parameter ω in Hz depicts how fast the signal’s energy varies along the temporal axis. The scale (or density) parameter Ω in cycle/octave characterizes how broad the signal’s energy distributed along the log-frequency axis. In addition, cortical neurons also show different selectivity of FM sweeping directions (upward and downward), which is represented in this module by the sign of the rate parameter (positive/negative for downward/upward sweeping direction).

To derive the spectro-temporal impulse responses of neurons in A1, moving ripple stimuli, the basis functions in the two-dimensional spectro-temporal domain, are used to drive the cortex. Figure 3-6 shows one example of the moving ripple stimulus of rate=+4 Hz and scale=0.5 cycle/octave. Therefore, each neuron in A1 has its own impulse response, which represents its preference on the spectro-temporal pattern shown in the input spectrogram, and is modeled by a 2D filter. To sum up, the first cochlear module of the auditory model produces a two-dimensional auditory spectrogram full of spectro-temporal amplitude modulations. The second cortical module then analyzes the auditory spectrogram by a bank of two-dimensional filters which are tuned to different spectro-temporal modulation parameters. Figure 3-7 demonstrates eight 2D cortical filtering of A1 on a sample spectrogram. The small top panels in each subplot are the impulse responses of different typical neurons tuned to slow/fast rates and coarse/fine scales. The bottom panels are envelopes (local energies) of outcomes of these 2D spectro-temporal filters.

(33)

Therefore, a four-dimensional output ( , , , )r t f ω Ω of this module can be

formulated as:

4

( , , , )

( , )

_tf

( , ; , )

r t f

ω

Ω =

y t f

∗

STIR t f

ω

Ω

(3-7) where ( , ; , )STIR t f ω Ω is the spectro-temporal impulse response of the

two-dimensional filter tuned to ω and Ω ; and ∗ is the two-dimensional _tf convolution in the time and log-frequency axes.

FIGURE 3-6

An example of moving ripple stimulus. ( Auditory Model Handout, AAIP)

(34)

FIGURE 3-7

The response for 8 modeled neurons in the cortex. (Auditory Model Handout, AAIP)

The local energy of the four-dimensional output is then computed as:

[

]

( , , , )

E t f

ω

Ω =

r t f

ω

Ω +

jH r t f

ω

Ω

(3-8) where H

[ ]

⋅ is the Hilbert transform along the log-frequency axis. Therefore, for any

fixed t-f point in the auditory spectrogram, ( , ; , )E ω Ω t f , which is referred to as the

rate-scale representation, records energies of local modulations at different combinations of rate, scale and directionality. As shown in Figure 3-8, the left panel demonstrates an auditory spectrogram and right panels are corresponding rate-scale representations of those two points indicated by ‘x’ in the spectrogram. As seen in the figure, those two ‘x’ points have local modulations dominated at (8 Hz, 4 cycle/octave, upward) and (8~16 Hz, 2~4 cycle/octave, downward) respectively.

(35)

spectrogram from a one-dimensional acoustic signal. The second cortical module analyzes amplitude modulations of the 2D auditory spectrogram in the rate-scale-directionality parameter space. Much more extensive details of the description, mathematic formulation and output examples of these two modules can be found in [26]. F requenc y (H z ) Time (ms) Auditory Spectrogram 200 400 600 800 1000 1200 125 250 500 1000 2000 Upward - 1. - 2. - 4. - 8. - 16. - 32. S ca le ( cyc/ o ct ) Downward 1. 2. 4. 8. 16. 32. 0.25 0.50 1.00 2.00 4.00 8.00 Rate (Hz) - 1. - 2. - 4. - 8. - 16. - 32. S ca le ( cyc/ o ct ) Rate (Hz) 1. 2. 4. 8. 16. 32. 0.25 0.50 1.00 2.00 4.00 8.00

FIGURE 3-8

Rate-scale representation from the A1 module.

It is known that human hearing analyzes not only spectral contents but also temporal behaviors of the sound. In our auditory model, such ability is well characterized by the joint spectro-temporal modulation analysis performed by the second cortical module. In addition to spectral contents estimated in the first cochlear module, certain high-level features, such as speaking rate and FM sweeping directions, are well caught by the second cortical module. It has been shown that joint spectro-temporal modulations below 16 Hz and 8 cycle/octave well preserve the intelligibility of speech [31]. Not surprisingly, as shown in [32], the long-term averaged rate-scale energy pattern of speech falls roughly within these ranges. On the

(36)

other hand, rate-scale patterns of noises would differ from those of speech, indicating different high-level information between speech and noises. For example, Figure 3-9 shows auditory spectrograms ((a), (b)) and rate-scale energy representations ((c), (d)) of clean speech and white noise. This figure demonstrates that most of the spectro-temporal modulations of speech are within the range of rate=2-16 Hz and scale=0.5-8 cycle/octave, while the white noise has spectro-temporal modulations dominated at high rates and high scales.

F requ enc y ( H z ) Time (ms)

(a) Come home right away.

200 400 600 800 1000 125 250 500 1000 2000 0 5 10 15 20 25 30 35 40 F requen c y ( H z ) Time (ms) (b) White noise 200 400 600 800 1000 125 250 500 1000 2000 0 2 4 6 8 10 12 - 2. - 4. - 8. - 16. - 32. - 64. -128. 0.50 1.00 2.00 4.00 8.00 (c) 2. 4. 8. 16. 32. 64.128. 0.50 1.00 2.00 4.00 8.00 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 - 2. - 4. - 8. - 16. - 32. - 64. -128. 0.50 1.00 2.00 4.00 8.00 (d) 2. 4. 8. 16. 32. 64.128. 0.50 1.00 2.00 4.00 8.00 0 0.005 0.01 0.015 0.02 0.025 0.03

FIGURE 3-9

Auditory spectrograms of (a) clean speech, and (b) white noise. Rate-scale representations (with rate and scale in x- and y- axis) of (c) clean speech, and (d) white noise.

(37)

Accordingly, a noise suppression algorithm by the joint spectro-temporal modulation filtering (STMF) is proposed in [29]. For an input noisy speech, spectro-temporal modulations only within 2~32 Hz and 0.5~8 cycle/octave are kept in the STMF process and a cleaner spectrogram is generated:

5 1 2 32,0.5 8

( , )

( , , , )

_tf

( ,

; , )

y t f

r t f

STIR

t

f

ω

∗

ω

± ≤ ≤± ≤Ω≤

=

∑

Ω ∗

− −

Ω

(3-9) where STIR t f₁( , ; , )ω Ω is the normalization of STIR t f( , ; , )ω Ω .

Figure 3-10 demonstrates procedures of our STMF noise suppression algorithm. The noisy auditory spectrogram is passed through the STMF process. Then, a simple threshold δ (a certain percentile of the maximum value of the cleaned spectrogram) is used to determine the speech versus non-speech regions in the cleaned spectrogram. The threshold δ bears the trade-off between effects of speech distortion and noise suppression. Finally, a α― 1 template (α for non-speech regions and 1 for speech regions)is generated and multiplied with the original noisy spectrogram to produce a noise-suppressed spectrogram. ACCs are then derived from the noise-suppressed spectrogram for our speaker recognition simulations.

(38)

F requ enc y ( H z ) Time (ms)

Noisy speech in car noise with 5dB SNR

500 1000 1500 2000 125 250 500 1000 2000 F re quenc y ( H z )

Auditory spectrogram after STMF (y5)

500 1000 1500 2000 125 250 500 1000 2000 F reque nc y ( H z ) Time (ms)

Speech vs non-speech region

500 1000 1500 2000 125 250 500 1000 2000 F req uenc y ( H z ) Time (ms)

Enhanced auditory spectrogram

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 125 250 500 1000 2000 Thresholding Multiplication

(39)

Chapter 4 Evaluation

The robustness of ACCs before and after the STMF process is evaluated in text-independent closed-set speaker identification simulations. Experimental settings follow the ones in [25] and results are compared to results from Auditory-based Nonnegative Tensor Cepstral Coefficients (ANTCCs) proposed in [25] as well. Speech samples from TIMIT and GRID [33] corpora are tested. Note that these corpora do not consider session variability. In this chapter, we first introduce the TIMIT and GRID database and the evaluation measurements used in this thesis. Then, simulation results will be shown in Section 4.2. Finally, discussions for these evaluations will be given in Section 4.3.

4.1 Database and Evaluation Measurements

Speech samples in TIMIT corpus were recorded at Texas Instruments (TI) and transcribed at Massachusetts Institute of Technology (MIT). Thus, it is called “TIMIT”. TIMIT was designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech and speaker recognition systems. The TIMIT corpus is with high quality speech, which is ideal for testing a technique without the interference of noise and channel variations. It contains a total of 6300 clear sentences, sampled at 16 kHz, 10 sentences uttered by each of the 630 speakers (438 males and 192 females) from 8 major dialect regions in the United States. In this thesis, the first 8 utterances (two sa sentences, three si sentences and two sx sentences) and the remaining 2 utterances per speaker are used as the training and testing sets, respectively. The training data for each speaker is approximately 24 seconds.

(40)

The GRID is an audio-visual corpus for the speech separation and recognition tasks. It is also with high quality speech from 34 speakers (18 males and 16 females), each saying 1000 three-second phrases. Each phrase consists of a sequence of 6 characters shown in figure 4-1 and is sampled at 24 kHz. As in [25], for the GRID testing, speech samples are downsampled to 8 kHz and 50/60 utterances per speaker are randomly chosen for training/testing purposes.

FIGURE 4-1 The 6 characters in GRID corpus.

Four different types of noises (factory, pink, white and F-16) are extracted from Noisex-92 [34] and mixed with the clean speech in a wide range of SNRs (0, 5, 10, 15dB and Clean). 30-coefficient feature vectors (excluding the 0th coefficient) of conventional MFCCs and ACCs are calculated from a 25 ms window with 10 ms frame increment. As in [35], each speaker is modeled by a 32-mixture GMM derived from the EM algorithm in the training process. The K-means is used to initialize the EM algorithm. Additionally, a simple technique — cepstral mean substraction (CMS) [6], which has been used in speaker recognition, is also considered in our works. Note, no VAD is used in this study.

(41)

4.2 Results

4.2.1 Results in GMM

To implement our STMF based speaker identification system, we first determine the parameters, the threshold δ and the template α, used in the STMF process. Figure 4-2 and Table 1 shows the identification performance of a GMM based recognizer on the GRID corpus with different STMF parameter sets. Clearly, the best average recognition rate can be obtained by δ= 0.05 and α= 0.01. However, Table 1 further shows that using δ= 0.02 & α=

e

-1 outperforms the parameter set of δ= 0.05 & α= 0.01 in high SNR conditions. Not surprisingly, it demonstrates the threshold δ bears the trade-off between effects of speech distortion and noise suppression. And these two effects are desired contradictorily in high and low SNR conditions. Therefore, we adopt two parameter sets (A:δ= 0.02 &α=

e

-1 , B: δ= 0.05 &α= 0.01) in following simulations.

threshold δ

0.00%

20.00%

40.00%

60.00%

80.00%

0.5 exp(-1) 0.1 0.01 0.0001

α for non-speech regions

0.01

0.02

0.05

0.1

(42)

Table 1. Correct recognition rates (in %) with different STMF (δ, α) parameters under

various SNRs of the pink noise.

δ= 0.01 δ= 0.02

_{15dB 10dB 5dB}

_{0dB 15dB 10dB 5dB 0dB}

α= 0.5 90.83 82.21 54.80 24.95 90.88 81.23 58.97 22.75 α= exp(-1) 85.88 74.61 57.25 31.86 90.54 81.49 60.64 19.71 α= 0.1 90.20 80.39 30.20 6.86 89.61 83.82 63.68 17.40 α= 0.01 88.04 77.99 23.87 3.43 90.10 83.53 66.32 9.36 α= 0.001 80.49 66.57 22.50 4.36 86.23 79.61 56.37 4.90 δ= 0.05 δ= 0.1

_{15dB 10dB 5dB}

_{0dB 15dB 10dB 5dB 0dB}

α= 0.5 85.74 74.17 54.61 13.43 85.88 74.07 52.01 28.97 α= exp(-1) 85.88 74.61 57.25 31.86 85.88 74.61 57.25 31.86 α= 0.1 85.44 78.28 67.50 41.72 84.95 72.99 58.48 42.30 α= 0.01 85.54 81.57 75.83 52.01 83.28 73.68 62.94 52.79 α= 0.001 83.58 79.12 71.13 38.77 80.54 74.46 64.22 55.20

Correct recognition rates of the MFCC baseline, our proposed features ACCs before and after STMF, and ANTCCs from [25] are presented in Table 2 and Figure 4-3. The 70-speaker population is randomly chosen from the TIMIT corpus and all speech samples are with 16 kHz sampling frequency. Without CMS, our features and ANTCCs clearly achieve much higher recognition rates than the MFCCs under noisy conditions. Under all tested SNR conditions, ACCs outperform MFCCs by a wide margin. In addition, the ACCs after STMF also outperform ANTCCs in almost all conditions (except in the condition of 15dB F16 noise).

It can be observed that our ACCs perform poorer in the white and F16 noises than in the pink and factory noises, especially in high SNR conditions. One possible reason for that is the white and F16 noises both possess higher energies in high-frequency regions than the pink and factory noises. The 128 constant-Q cochlear filters possess constant frequency resolution and are normalized to have an almost flat

(43)

overall frequency response along the log-frequency axis. For the white and F16 noises, high energies within high-frequency regions would repetitively appear in several high-frequency cochlear channels due to their wide bandwidth. This phenomenon produces more severe mismatch between ACCs from noisy speech and from clean speech by high-frequency noises than by low-frequency noises.

Table 2. Correct recognition rates (in %) of 70 people in TIMIT corpus.

Factory1

Pink

_{15dB 10dB 5dB}

_{0dB 15dB 10dB 5dB 0dB}

MFCC 66.43 34.29 10.71 5.71 47.86 22.86 8.57 2.86 MFCC-CMS 65.00 32.86 11.43 2.86 46.43 21.43 5.00 2.86 ACC 93.57 80.00 37.86 12.86 87.86 62.14 25.00 9.29 ACC-CMS 82.14 52.86 27.86 6.43 73.57 42.14 18.57 4.29 STMF (A) 94.29 86.43 51.43 10.00 90.71 70.00 41.43 7.86 STMF-CMS (A) 85.00 75.00 57.14 11.43 80.00 64.29 45.00 12.14 STMF (B) 85.00 78.57 62.86 21.43 84.29 76.43 64.29 29.29 STMF-CMS (B) 84.29 75.00 66.43 31.43 80.71 72.14 65.00 38.57 ANTCC 78.1 49.52 12.86 2.43 78.57 50.95 13.81 2.43

White

F16

_{15dB 10dB 5dB}

_{0dB 15dB 10dB 5dB 0dB}

MFCC 30.00 15.71 7.86 4.29 45.71 16.43 3.57 2.14 MFCC-CMS 36.43 18.57 10.00 2.86 58.57 28.57 7.86 2.86 ACC 51.43 17.14 10.00 4.29 56.43 28.57 11.43 5.00 ACC-CMS 51.43 34.29 20.71 11.43 73.57 56.43 20.00 4.29 STMF (A) 71.43 45.00 16.43 5.00 69.29 38.57 18.57 4.29 STMF-CMS (A) 67.86 51.43 37.86 12.14 82.14 68.57 55.00 10.71 STMF (B) 72.14 60.71 48.57 16.43 67.86 53.57 36.43 10.71 STMF-CMS (B) 76.43 66.43 60.00 28.57 78.57 73.57 63.57 29.29 ANTCC 64.29 29.52 3.81 2.9 77.62 47.14 15.24 2.9

MFCC MFCC-CMS ACC ACC-CMS ANTCC

Clean 100 100 100 98.57 97.62

(44)

Evaluation Result

0

10

20

30

40

50

60

70

80

90

100 clean

15dB

10dB

5dB

0dB

SNR(dB)

R

ecog.

R

at

e(

%

)

MFCC

MFCC-CMS

ACC

ACC-CMS

STMF (B)

STMF-CMS (B)

ANTCC

FIGURE 4-3 Average recognition rates (in %) of 70 people in TIMIT corpus.

On the other hand, the CMS helps the recognition rates of ACCs before and after STMF in low SNR conditions (0~10 dB) as shown in Figure 4-3. However, the CMS could not only mitigate the noise effect, but also reduce the speaker variability. The recognition rates, therefore, might be diminished in high SNR conditions as the clean condition in Table 2 or Figure 4-3.

Evaluation results of the GRID corpus are presented in Table 3 and Figure 4-4. First, we consider results without CMS normalization. Similar to TIMIT results, ACCs outperform MFCCs in all noises (all SNR conditions) and the STMF process enhances recognition rates further. Compared with ANTCCs, STMF perform 25.3% better (average recognition rate) in low SNR conditions (5 and 0 dB), but 10.24% (in average) worse in high SNR conditions (15 and 10 dB). Note, not identical training

(45)

and testing sets are used in this study and in [25] for both TIMIT and GRID corpora evaluations shown in Table 2 and Table 3. Clearly, our features perform slightly worse against the white and F16 noises than against the pink and factory noises in both TIMIT and GRID corpora evaluations.

Table 3. Correct recognition rates (in %) of GRID corpus.

Factory1

Pink

_{15dB 10dB 5dB}

_{0dB 15dB 10dB 5dB 0dB}

MFCC 79.46 60.78 34.26 13.97 75.98 51.91 26.27 7.99 MFCC-CMS 73.28 51.72 24.46 5.69 69.12 47.99 22.55 5.83 ACC 87.84 75.29 55.49 31.81 87.79 72.40 49.17 23.14 ACC-CMS 86.32 73.63 52.35 20.34 83.92 70.83 47.94 22.45 STMF (A) 90.39 81.91 58.58 20.39 90.54 80.49 60.64 19.71 STMF-CMS (A) 89.71 84.46 55.78 12.30 88.73 83.24 57.79 11.52 STMF (B) 85.78 81.32 75.00 41.47 85.54 81.57 75.83 52.01 STMF-CMS (B) 81.62 80.00 74.71 46.86 80.15 78.73 75.15 51.32 ANTCC 97.55 87.75 44.61 8.82 95.59 87.75 45.1 9.31

White

F16

_{15dB 10dB 5dB}

_{0dB 15dB 10dB 5dB 0dB}

MFCC 55.74 38.97 27.16 14.61 69.07 46.76 27.06 15.05 MFCC-CMS 52.16 38.58 27.50 8.92 71.18 50.83 22.99 5.78 ACC 76.47 55.29 30.69 14.66 65.69 44.36 20.93 10.10 ACC-CMS 72.25 55.54 40.20 32.06 83.09 68.43 44.90 19.17 STMF (A) 83.28 72.65 45.83 19.71 88.14 82.65 52.55 8.04 STMF-CMS(A) 88.87 84.61 74.61 46.18 90.88 87.65 78.68 36.42 STMF (B) 82.30 77.01 67.06 30.49 66.76 55.83 39.26 14.90 STMF-CMS (B) 78.09 76.81 71.91 45.54 80.83 79.41 72.01 36.86 ANTCC 95.59 69.61 38.24 10.29 95.1 69.12 27.49 9.8

MFCC MFCC-CMS ACC ACC-CMS ANTCC

Clean 99.56 96.62 99.26 95.69 100

STMF (A) STMF-CMS(A) STMF (B) STMF-CMS(B)

(46)

Evaluation Result

0

10

20

30

40

50

60

70

80

90

100 clean

15dB

10dB

5dB

0dB

SNR(dB)

Recog. Rate(%)

MFCC

MFCC-CMS

ACC

ACC-CMS

STMF (B)

STMF-CMS (B)

ANTCC

FIGURE 4-4 Average recognition rates (in %) of GRID corpus.

The CMS normalization yields similar trends for ACCs and STMF features in both corpora evaluations as shown in Figure 4-3 and 4-4. That is, the CMS enhances average recognition rates in low SNR conditions but degrades the performance in the 15dB and clean condition. However, the CMS produces worse average performance for MFCCs. It is interesting to note that all features (MFCCs, ACCs and STMF) for both corpora (TIMIT and GRID) benefit from the CMS normalization in the F16 noise as shown in Table 2 and Table 3. We could conclude that the CMS normalization is particularly effective against the F16 noise.

(47)

4.2.2 Results in MAP-GMM

In this section, we further adopt the MAP-GMM method to boost the performance of the STMF-CMS feature. In implementing the MAP-GMM in the speaker identification system, we first determine the number of mixtures used in the UBM (Universal Background Model). Effects from different numbers of mixtures are investigated by building different UBMs from the sa1 sentence of all 630 speakers in the TIMIT corpus. In order to match testing samples from the GRID corpus with 8 kHz sampling frequency, the training sa1 sentences are downsampled to 8 kHz. Then, each speaker’s model is built from his training speech and the UBM by adapting the mean and variance via the MAP-GMM approach as shown in Section 2.3. Recognition rates in testing the GRID corpus using a 256-, 512- and 1024-mixture UBM are listed in Figure 4-5. Since higher-mixture UBMs only slightly improve the performance but with much heavier computational loads, the 256-mixture UBM is used in this thesis. Figure 4-6 shows the effects of building the 256-mixture UBM by different training data sets. The UBM is built either on 630 sa1 sentences only or combined with 630 sa2 sentences. As shown in Figure 4-6, the recognition rates of two different training sets are very close. Therefore, the 256-mixture UBM trained from sa1 sentences is used in succeeding tests.

(48)

Evaluation Result

0 10 20 30 40 50 60 70 80 90 100

15dB

10dB

5dB

0dB

SNR(dB)

Recog. Rate(%)

256mixture 512mixture 1024mixture

FIGURE 4-5 Correct recognition rates (in %) for GRID corpus: MAP-GMM

(adapted mean and variance) for the STMF-CMS(A) feature with different mixture numbers

Evaluation Result

0 10 20 30 40 50 60 70 80 90 100 15dB 10dB 5dB 0dB SNR(dB) R ec og. R at e( %) Sa1

Sa1 and sa2

FIGURE 4-6 Correct recognition rates (in %) for GRID corpus: a 256-mixture

(49)

Evaluation Result

0 10 20 30 40 50 60 70 80 90 100 15dB 10dB 5dB 0dB

SNR(dB)

Recog. R

ate(%)

UBM-m UBM-m CMS UBM-mv UBM-mv CMS UBM-mvw UBM-mvw CMS

FIGURE 4-7 Average recognition rates (in %) of GRID corpus using the STMF(A)

feature with various adaptations and CMS normalization

The 256-mixture UBM is utilized to evaluate performance by adaptations of parameters (mean, mean and variance, mean variance and weight) with and without the CMS normalization. Figure 4-7 indicates that the best overall performance is from adapting only the mean vectors with the CMS normalization. Thus, simulations in following sections are done in the scenario of a 256-mixture UBM with mean adaptation and the CMS normalization. Two UBMs built from 8 kHz and 16 kHz sampling frequency training sa1 sentences are used to match different sampling frequencies of GRID and TIMIT corpora.

(50)

Table 4. Correct recognition rates (in %) of 70 people in TIMIT corpus.

Factory1

Pink

15dB 10dB 5dB

0dB 15dB 10dB 5dB 0dB

MFCC-GMM-CMS 65.00 32.86 11.43 2.86 46.43 21.43 5.00 2.86 MFCC-UBM-CMS 53.57 33.57 10.00 3.57 42.14 22.86 7.86 3.57 ACC-GMM-CMS 82.14 52.86 27.86 6.43 73.57 42.14 18.57 4.29 ACC-UBM-CMS 91.43 72.86 45.71 16.43 86.43 60.71 29.29 6.43 STMF-GMM-CMS (A) 85.00 75.00 57.14 11.43 80.00 64.29 45.00 12.14 STMF-UBM-CMS (A) 92.14 85.00 62.14 16.43 88.57 78.57 56.43 10.00 STMF-UBM-CMS (B) 84.29 75.00 66.43 31.43 80.71 72.14 65.00 38.57 STMF-GMM-CMS (B) 77.14 71.43 62.86 33.57 75.71 65.71 51.43 41.43

White

F16

15dB 10dB 5dB

0dB 15dB 10dB 5dB 0dB

MFCC-GMM-CMS MFCC-UBM-CMS ACC-GMM-CMS ACC-UBM-CMS

Clean 100 100 98.57 98.57 STMF-GMM-CMS (A) STMF-UBM-CMS (A) STMF-GMM-CMS (B) STMF-UBM-CMS (B) Clean 98.57 98.57 95.71 92.86

Table 4 and Figure 4-8 show recognition rates of six approaches (three features with CMS; GMM or MAP-GMM) for the TIMIT corpus. Table 5 and Figure 4-9 show the comparison for the GRID corpus. Based on these results, the MAP-GMM clearly has the superior performance to the GMM for all features (MFCCs, ACCs and STMF). Compared with GMM, MFCCs, ACCs and STMF via the MAP-GMM approach

(51)

perform 9.9%, 12.46% and 13.92% better (average recognition rate in 5dB and 10dB SNR condition) in TIMIT corpus, and perform better 14.97%, 15.15% and 16.48% in GRID corpus. Therefore, the computational complexity, which is not addressed here, will be the only issue in choosing GMM versus MAP-GMM for a practical system.

Table 5. Correct recognition rates (in %) of GRID corpus.

Factory1

Pink

15dB 10dB 5dB

0dB 15dB 10dB 5dB 0dB

White

F16

15dB 10dB 5dB

0dB 15dB 10dB 5dB 0dB

MFCC-GMM-CMS MFCC-UBM-CMS ACC-GMM-CMS ACC-UBM-CMS

Clean 100 96.52 95.69 95.78 STMF-GMM-CMS (A) STMF-UBM-CMS (A) STMF-GMM-CMS (B) STMF-UBM-CMS (B) Clean 94.07 94.95 86.37 84.41

(52)

Evaluation Result

0

10

20

30

40

50

60

70

80

90

100 clean

15dB

10dB

5dB

0dB

SNR(dB)

Recog

. R

at

e(

%)

MFCC-GMM-CMS MFCC-UBM-CMS ACC-GMM-CMS ACC-UBM-CMS STMF-GMM-CMS (A) STMF-UBM-CMS (A)

FIGURE 4-8 Average recognition rates (in %) of 70 people in TIMIT corpus.

Evaluation Result

0

10

20

30

40

50

60

70

80

90

100 clean

15dB

10dB

5dB

0dB

SNR(dB)

Recog

. Rate(%)

MFCC-GMM-CMS MFCC-UBM-CMS ACC-GMM-CMS ACC-UBM-CMS STMF-GMM-CMS (A) STMF-UBM-CMS (A)

(53)

4.3 Discussions

From the experiment results in the previous section, it is clearly that ACCs are more robust than MFCCs. In addition, the STMF process could further enhance the performance of ACCs. Here are some reasons for these phenomenons.

The ACCs are derived from the auditory spectrum which represents speech energy along the log-frequency axis with the 24 cochlear filters per octave frequency resolution. This constant frequency resolution is high enough to characterize the 1/3~1/6 octave critical bandwidth measured in human hearing. In addition, the lateral inhibitory network in the first cochlear module sharpens the cochlear filters to have narrower bandwidth. On the other hand, the MFCCs use FFT to transform a time domain signal into the frequency domain. Such conventional approach has the trade-off between the time and frequency resolution. This constant-frequency-resolution versus time-frequency-resolution-trade-off might be the main reason for why hearing based features usually perform better than conventional FFT-based features.

The STMF process comprises several significant concepts. First, it works on joint spectro-temporal modulations not spectral or temporal modulations separately. Thus, we can extract more high-level features, such as the speaking rate (by temporal modulations) and FM sweeping directions (by joint spectro-temporal modulations). Secondly, the 10-2 - 1 mask (10-2 for non-speech regions and 1 for speech regions) is intuitively similar to the conventional VAD approach. The major difference is that the VAD masks the non-speech regions on a frame-by-frame basis in the time domain while the STMF process masks the non-speech t-f units in the joint spectro-temporal domain.

(54)

within 0~10dB than the parameter set STMF(A), but worse in clean and 15dB conditions with the GMM based recognizer. It is not surprisingly since adopting the higher threshold would inevitably degrade the intelligibility of the clean speech such that the recognition rates decrease in high SNR conditions. On the other hand, the parameter set of STMF(A) produces better results than the parameter set of STMF(B) in the MAP-GMM based recognizer as shown in Table 4 and Table 5.The reason for that is the UBM is trained by the clean speech through the STMF process. The MAP-GMM is then to derive the speaker’s model by adapting the well trained UBM. As mentioned above, the parameter set of STMF(B) performs worse in clean conditions, therefore, to construct a worse UBM. Thus, MAP-GMM speaker models adapted from this UBM would have worse performance due to the worse of speaker variability of this UBM. Consequently, the parameter set of the STMF(B) is suitable for GMM based recognizers while the parameter set of the STMF(A) is more favorable for MAP-GMM based recognizers.

時域-頻域上的聽覺頻譜平滑化之強健性語者辨識

國 立 交 通 大 學

電信工程研究所

碩 士 論 文

時域-頻域上的聽覺頻譜平滑化之

強健性語者辨識

Spectro-temporal Smoothed Auditory

Spectra for Robust Speaker Recognition

研 究 生：林廷翰 Student: Ting-Han Lin

指導教授：冀泰石 博士 Advisor: Dr. Tai-Shih Chi

時域-頻域上的聽覺頻譜平滑化之

強健性語者辨識

Spectro-temporal Smoothed Auditory Spectra for

Robust Speaker Identification

研 究 生：林廷翰 Student: Ting-han Lin

指導教授：冀泰石 博士 Advisor: Dr. Tai-shih Chi

國立交通大學

電信工程研究所

碩士論文

A Thesis

Submitted to Institute of Communication Engineering

College of Electrical and Computer Engineering

National Chiao-Tung University

In Partial Fulfillment of the Requirements

for the Degree of

Master of Science in

Communication Engineering

July 2010

Hsin-Chu, Taiwan, Republic of China

時頻上的平滑聽覺頻譜之語者辨識

學生：林廷翰 指導教授：冀泰石 博士

國立交通大學電信工程研究所

中文摘要

Spectro-temporal Smoothed Auditory Spectra for

Robust Speaker Recognition

Student: Ting-han Lin Advisor: Dr. Tai-shih Chi

Institute of Communication Engineering

National Chiao-Tung University

English Abstract

誌 謝

Contents

Chinese Abstract

English Abstract

Acknowledgement

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Introduction

1.2 Motivation

1.3 Outline of this thesis

Chapter 2 Speaker Recognition Systems

2.1 Introduction to Speaker Recognition Systems

2.2 Gaussian Mixture Models

2.3 Maximum A Posteriori Adapted Gaussian Mixture Models

Chapter 3 Auditory Model and Features

3.1 The Motivated Use of Auditory Model

3.2 Cochlear Module and Auditory cepstral coefficients

3.3 Cortical Module and Spectro-temporal Modulation Filtering

Chapter 4 Evaluation

4.1 Database and Evaluation Measurements

4.3 Discussions

Chapter 5 Conclusion and Future Works

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Introduction

1.2 Motivation

FIGURE 1-2

1.3 Outline of this thesis

Chapter 2 Speaker Recognition Systems

2.1 Introduction to Speaker recognition systems

2.2 Gaussian Mixture Model

(

)

(

, )

p x

λ

w N x

國立交通大學

碩士論文

研究生：林廷翰 Student: Ting-Han Lin

指導教授：冀泰石博士 Advisor: Dr. Tai-Shih Chi

研究生：林廷翰 Student: Ting-han Lin

指導教授：冀泰石博士 Advisor: Dr. Tai-shih Chi

學生：林廷翰指導教授：冀泰石博士

誌謝

_K