Speaker Adaptation - 模糊邏輯控制於語者調適及音訊事件偵測之參數調適

1. Introduction

1.2 Speaker Adaptation

Computing techniques for automatic speech recognition have existed for years [3]

and, with the ever growing maturity, have found more and more applications in current daily life [4]. Nevertheless, the recognition performance of all speech recognition systems ever built is undeniably inferior to a human listener as already pointed out in [5].

Fig. 1.2. The operating structure of a typical speech recognition system.

Fig. 1.2 depicts the operating structure of a typical speech recognition system for capturing specific short phrases or primitive statements only. Note that during the operation any disturbances causing a mismatch between the pre-established reference templates and the testing template would compromise the recognition performance and the sources of disturbances may include

 speech from speakers strange to the system

 speech from speaker known to the system, only in poor “vocal shape”

 various interferences in the background

 channel distortion induced in the acquisition process

and so forth.

Countermeasures can be taken in two aspects:

Pre-processing -Framing -Pre-emphasis -Hamming window Signal input

Feature extraction

Reference templates

Testing template Training

Template matching Recognized result

(1) Signal filtering and normalization are deployed so that the operating condition is in as much alignment with the referential condition as could be done.

(2) Internal tuning of the referential settings is undertaken so that the system adapts toward the actual operating environment when new speakers appear.

Techniques in the first category work at the level of signal processing, and are referred to as speech enhancement or feature-based adaptation through which noises adhered to the signals are removed to make the speech signals as clean and thus resemble to reference templates as possible. The cepstral mean normalization (CMN, or cepstral mean subtraction CMS) [6] and signal bias removal (SBR) [7] fall into this category too and are popular for their simplicity and effectiveness.

Approaches in the second category use sample utterances collected from the new speaker (the end-user of the system) for adapting the system internal parameter settings of the pre-established speech model. Consequently, they are referred to as model-based adaptation or speaker adaptation.

Speech Recognition

Speaker Adaptation

Bayesian-based Adaptation (ex: MAP, VFS)

Transformation-based Adaptation (ex: MLLR)

Eigenvoice-based Adaptation

Fuzzy Logic Control

FLC-VFS

FCMAP FLC-MLLR

Fig. 1.3. Three categories of speaker adaptation techniques in speech recognition.

Fig. 1.3 reveals the chronological development of the three major speaker adaptation schemes. MAP adaptation, appearing around 1991 and the representative of Bayesian-based adaptation, works better than the ML (maximum likelihood) estimate of the adaptation by taking into account the information of prior means of the model. By the nature of MAP computation, in the speech model only the portions associated with the adaptation samples get updated, for which case VFS scheme came into the play as a supplement to MAP by extending the coverage of adaptation in the model space. The MAP-VFS adaptation in general offers more satisfaction in recognition performance than MAP alone given the same adaptation data. MLLR adaptation first appeared in 1995 and became the representative of transformation-based adaptation, where linear regression was employed to derive the transformation matrix using ML-estimate. Note that through the transformation by matrix multiplication, the entire model space is adapted at one time despite the fact that the sample utterances might convey very limited information for adaptation. In a sense, MLLR adaptation provides with an overall but somewhat coarser speech model adaptation, in contrast to MAP adaptation which brings about a local and yet specific effects of adaptation, given the same adaptation samples.

One thing that is common to both MAP and MLLR is that the quality of adaptation depends on the amount and adequacy of the adaptation samples: the more the samples, the better the adaptation quality which in turn determines the recognition performance. When the adaptation utterances from a new speaker are insufficient, the effects of either MAP or MLLR adaptation would be questionable: the recognition rate of which would fall below the baseline, i.e., worse than no adaptation at all as shown by the author’s experiments [8, 9].

Eigenvoice-based adaptation [10-20] is a relatively young member in the speaker adaptation family, first appearing around 2000, and is also known as

speaker-clustering-based adaptation where a speaker dependent (SD) speech model is established for every member in a group of speakers, from which feature vectors called as eigenvoices are extracted through PCA for building the eigenvoice speech model. The adaptation to the speech model (an eigenvoice vector space) then can be undertaken when adaptation data is available, as shown in Fig. 1.4.

. . . . . .

. . . . . . . .

PCA .

. .

ML Estimate

. . . Adaptation Data

. . .

Speaker 1SpeakerN

Speaker Adapted Model

SD models SA model

unused weights

11 ₁₁

12 ₁₂

13 ₁₃

N _N₁

N _N₂

N _N₃

1

2

3

Fig. 1.4. Eigenvoice-based adaptation.

To summarize, speaker adaptation is a process that turn speaker-independent (SI) speech models into speaker-adapted (SA) ones, as is clearly seen in Fig. 1.5.

SI models

speech recognition system

SI models

speech recognition system

SA models

shipping

from manufacturer to end user

adaptation data from end users

adaptation process

Fig. 1.5. Speaker adaptation scheme.

To ensure the quality of the adaptation at the scarcity of adaptation samples, the author proposes a general framework for enhancing MAP, VFS and MLLR adaptation, and the resultant implementations are named as FCMAP, FLC-VFS and FLC-MLLR respectively where FLC stands for fuzzy logic control, indicating the underlying fuzzy mechanism incorporated in the general system architecture.

在文檔中模糊邏輯控制於語者調適及音訊事件偵測之參數調適 (頁 15-20)