S PEECH F EATURE E XTRACTION - 探索調變頻譜特徵之低維度結構應用於強健性語音辨識

The notion of speech feature extraction is converting the input audio signal into a series of speech feature and simultaneously incorporates biological intuitions or auditory characteristics to capture more linguistic information. So far, the speech feature extraction developments are MFCC and perceptual linear prediction (PLP) [20]. In this thesis, all the experiments are mainly conduct on MFCC feature.

In the ASR systems, the auditory characteristics and biological intuitions are brought into the front-end feature extraction stage to derive more robust speech feature representation. MFCCs take into account a well-known property of human speech perception: human ears are more sensitive to frequency change on low-pitch sounds than high-pitch sounds. Thus, MFCC can concisely renders the acoustic spectral

characteristics within a short period of time. Many researches have shown that MFCC has exhibited high discriminating capability for acoustic units and achieves excellent recognition accuracy [1], but in noise environment is in contrast. MFCC is vulnerable to noise/interference and often requires compensation prior to being used in real-world scenarios. Therefore, we chose MFCC to build our robust experiments.

The extraction of MFCCs contains three major steps [21]:

1. Spectral shaping: This step converts the analog speech signal into its digital counterpart, and pre-processes the digital signal to make it more suitable to speech recognition applications.

2. Spectral analysis: This step converts the digital signal into spectrogram, and analyzes the spectrogram to extract useful information for speech recognition.

Figure 2-1 The deatil flow chart of MFCC extraction

3. Coefficient transformation: This step further converts the spectral information into cepstral feature vectors, in order to work around the assumptions of acoustic models and to reduce the dimensionality.

These steps are described in more detailed in the following three subsections and the overall flowchart is shown in Figure 2-1.

2.1.1 Spectral Shaping

When a speech utterance in the time-domain signal form is received, it first needs to be pre-emphasized. This step is designed to increase the magnitude of high-frequency bands, while leaving the low-frequency components in their original state. The reason for pre-emphasized is to compensate the distortion caused by the human speech production process. Commonly used approach for pre-emphasized high-pass filter is

Z-transform, which can be expressed in follow:

𝐻(𝑧) = 1 − 𝛼 ∙ 𝑧⁻¹ (2.1)

where 𝛼 is the factor of pre-emphasis, and is usually set between 0.9 and 1. Next,

applying Z-transform to time-domain signal, we can restart as:

𝐬̃[𝑛] = 𝐬[𝑛] − 𝛼𝐬[𝑛 − 1] (2.2)

where 𝐬[𝑛] is speech signal with n sample points in time domain and 𝑠̃[𝑛] is speech signal after pre-emphasis. After pre-emphasized, a framing operation is performed on it and turns it into a sequence of speech frames.

Normally a speech signal is non-stationary and changes quite rapidly over time, but seen from a short-time point of view that is in contrast. This result from the fact that the glottal system cannot change immediately, which indicate speech signal is short-term stationary. Thus, we can analyzed speech signal by dividing it into short window of time.

A “frame” represents a short window of signal, which segments the original speech

signal and is partially overlapped with its preceding and succeeding frames. Typically, the frame in ASR applications is stationary in time windows between 20ms and 30ms and is overlapped with other frames for about 5ms to 15ms.

The following step is spectral analysis, which takes the use of discrete Fourier transform, the non-continuous nature of a speech frame will produce some high

frequency noise. To solve this problem, the solution is applying a Hamming window 𝐰[𝑛] on each speech frame:

𝐬̃[𝑛] = 𝐬[𝑛] ∙ 𝐰[𝑛]

𝐰[𝑛] = {0.54 − 0.46 cos (𝑁−1^2𝜋𝑛) 0 ≤ 𝑛 ≤ 𝑁 − 1

0 otherwise (2.3) where 𝑁 is the number of sample points in each frame, 𝐬̃[𝑛] is speech frame after the Hamming window. By applying the Hamming window, the processed frame is approximately continuous when repeated, thus it can efficiently avoid the high-frequency noise problem.

2.1.2 Spectral Analysis

As it is difficult to analyze a speech signal in the time-domain, we convert the signal inside each frame into a frequency-domain spectrum representation. This is by the discrete Fourier transform:

𝑋[𝑘] = ∑ 𝐱[𝑛]𝑒^{−2𝑖𝜋𝑘}^𝑁^𝑛

𝑁−1

𝑛=1

𝑘 = 0, … , 𝑁 − 1 (2.4)

where 𝐱[𝑛] represents the time-domain signal in the processed frame, 𝑁 the number of sample points in this frame, and 𝑋[𝑘] the spectrum of this frame.

The frequency of human ears perception and the speech signal is non-linearly related, in fact that is logarithmic. In concisely, the human ears are sensitive to the low-frequency, in order to simulate the behavior of human ears. The triangular band-pass filter is the best choice to be applied. These filters are partially overlapped and evenly distributed along the Mel-scale frequency axis [22]. The Mel scale is a scale of pitches adjusted according to how human listeners judge the frequency distances. Its

relationship to the linear frequency scale can be expressed as:

Mel(𝑓) = 1127ln (1 +₇₀₀^𝑓 ) (2.5)

Besides simulating human ear behaviors, the filter bank can also reduce the dimensionality of data and keeping the most useful information for speech recognition.

Furthermore, we take logarithm on the results of filter bank, so that these coefficients are less sensitive to drastic frequency changes.

2.1.3 Coefficient Transformation

While the log-filter-bank coefficients already retain most useful information and are at low dimensionality, these coefficients are largely correlated with each other, which is not a good choice for HMM-based recognizer. Therefore, the type-II discrete cosine transform is used to diagonalize these coefficients and transform them into the cepstral domain, yielding the MFCC. Although typically we have more than 20 filter-bank coefficients, only 12 or 13 MFCCs are kept, which is generally empirical. Although there is some overlapping between frames, the MFCC feature vectors still describe only information obtained from a single frame. Thus, appending delta coefficients to the original MFCC features does make sense to incorporate some contextual information.

To calculate the delta coefficients, we can use the following equation:

𝐝_𝑡= ^∑^𝑁^𝑛=1_{2 ∑}^𝑛(𝐜_𝑁^𝐭+𝐧_𝑛^−𝐜₂^𝐭−𝐧⁾

𝑛=1 (2.6) where 𝐝_𝑡 is the delta coefficient at time 𝑡, 𝑐_𝑡 the corresponding MFCC coefficient, and 𝑁 the step size of this delta operation. The acceleration, that is, delta of delta, can also be calculated in this manner by applying Equation (2.5) on the delta features again.

在文檔中探索調變頻譜特徵之低維度結構應用於強健性語音辨識 (頁 24-29)