Application to Speech Recognition - 靜、動態神經模糊建模技術及其在語言辨識上的應用(III)Static/Dynamic Neuro-Fuzzy Mode

3.1 LPC Cepstrum Features

In the following, we describe brieﬂy the LPC cep-strum features to be used as input in our approach.

In speech recognition, feature extraction plays an important role. The quality of feature extrac-tion signiﬁcantly aﬀects the performance of speech recognition. Some of the most signiﬁcant feature extractions methods include linear predictive cod-ing(LPC) [3], cepstrum transform [33], and LPC cepstrum transform.

3.1.1 Linear Predict Coding (LPC)

Linear predict coding [3] is an important signal representation in speech processing due to the ef-fective speech representation and fast computing speed. The basic concept is that we can use the linear combination of the ﬁrst P samples to repre-sent all speech samples.

We deﬁne {αk} to be the LPC prediction co-eﬃcients. The notation S(n) and S(n) stand for the speech signal and the predicted speech signal, respectively. A linear predictor with prediction co-eﬃcients is deﬁned as

S(n) =

P k=1

αk∗ S(n − k). (121)

The short-time average prediction error E of the predicted speech is deﬁned as

E =

N n=1

e(n)², (122)

e(n) = |S(n) −

P k=1

αk∗ S(n − k)|. (123)

For obtaining the LPC parameters, we should min-imize the predictive error. Various formulations of linear prediction analysis have been proposed, such as the autocorrelation method,the covariance method, the lattice method, and so on.

3.1.2 Cepstrum

The cepstrum [33] is deﬁned as the power spec-trum of the logarithm of the power specspec-trum. It can represent the frequency envelope and the prop-erties of a tiny speech variation.

The deﬁnition of cepstrum is deﬁned as follows:

C(n) = 1 N

N−1 k=0

log|X(k)|e^j2πkn^N ,

0≤ n ≤ N − 1. (124) The{X(k)} are the Fourier parameters of speech signals. The computation complexity of this method is very high. We can compute recurrently the cepstrums by using the LPC parameters as presented previously. This method is deﬁned as

C(n) = α(n) +

n−1

k=1

n)C(k)α(n− k),

1≤ n. (125) The {α(n)} are the LPC parameters and the {C(n)} are the LPC cepstrum features.

3.2 Dynamic Time Warping Algorithm Even if the same speaker utters the same word, the duration changes every time with nonlinear contraction and expansion. DTW was proposed for solving this problem. It can nonlinearly ex-pand or contract the time axis to match the same phonemes between reference patterns and input patterns.

The nonlinear matching method can be eﬃ-ciently accomplished by using the dynamic pro-gramming (DP) technique [42, 2]. DTW classiﬁes an input pattern to a class of reference patterns by making use of the minimal diﬀerence between the input pattern and reference patterns. The method is similar to searching the shortest paths in weighted graphs.

The local minimal diﬀerence of the previous m input frames is deﬁned as

D(m, n) = d(m, n) + min

⎛

⎝ D(m− 1, n) D(m− 1, n − 1) D(m− 1, n − 2)

⎞

⎠,

1≤ m ≤ M, 1 ≤ n ≤ N. (126) The notation D(m, n) means the local minimal diﬀerence between the previous m input frames and the previous n reference frames. The nota-tion d(m, n) means the frame diﬀerence between the mth input frame and nth reference frame.

The nonlinear matching algorithm will stop when we obtain the global minimal diﬀerence D(M, N ).

Figure 13 shows the matching process. Next, we select a winning reference pattern with the mini-mal D(M, N ):

winner = min

arg(p)(Dp(M, N )), 1≤ p ≤ P.(127)

The input pattern M

N D(M,N)

D(m,n)

The reference pattern

D(m-1,n-1) D(m-1,n-2) D(m-1,n)

Figure 13: The matching process of DTW.

Figure 14: Framework of our speech recognition system.

This input pattern is classiﬁed to the class of the wining reference pattern.

3.3 Our Speech Recognition System

The framework of our speech recognition system is mainly based on LPC cepstrum features and ITRFN as shown in Figure 14. Revisions of net-work architecture and operations are made more suitable for speech recognition. The detail of our system is described as follows.

3.3.1 Data Preprocessing

For each training pattern X with class “C”, we divide it into n frames. Assume the order of LPC cepstrum features is N and the training pattern X is deﬁned as

X = (X(1), X(2), . . . , X(n)), X(t) = (x1(t), x2(t), ..., xN(t)),

1≤ t ≤ n. (128) These frames are fed to the network sequentially.

At the tth time period, we assume the input frame X(t) is presented at the input layer and the train-ing phase of this input frame begins.

3.3.2 Architecture

The architecture of ITRFN-based speech recogni-tion system is similar to the one shown in Fig-ure 10. There are ﬁve layers included in the net-work and the operation of each layer is described as follows.

1. Layer 1: Input layer. Layer 1 contains N nodes. Node i of this layer produces output o⁽¹⁾_i (t) by transmitting its input signal xi(t) directly to layer 2, i.e.,

o⁽¹⁾_i (t) = xi(t). (129) 2. Layer 2: Frame layer. Tow types of nodes are included in this layer. For the ﬁrst type of nodes accepting inputs from the ﬁrst layer, there are J groups and each group contains N × Fj nodes. Note that J represents the number of reference patterns stored in the network and Fj represents the number of frames included in reference pattern j. The ith node in the f th frame of reference pattern j produces the output, E.o⁽²⁾_{if j}(t), by comput-ing the value of the correspondcomput-ing Gaussian function g with mean mif jand deviation σif j, i.e.,

E.o⁽²⁾_{if j}(t) = g(o⁽¹⁾_i (t); mif j, σif j)

= exp

−

xi(t)− mif j

σif j

. (130) For the other type of nodes accepting inputs from the internal variable hj(t), the number of nodes is J . The jth node produces its out-put, I.o⁽²⁾_j (t), by computing the value of the corresponding sigmoid function with two pa-rameters, bj and cj, i.e.,

I.o⁽²⁾_j (t) = s(hj(t); bj, cj)

= 1

1 + e^−b^j^h^j^(t)^−c^j. (131) 3. Layer 3: Pattern layer. Layer 3 contains J nodes and each node represents one reference pattern. Moreover, each pattern node and its corresponding frame nodes compose of a pat-tern subnet. Node j’s output, o⁽³⁾_j (t), of this layer is the product of all its inputs from layer 2, i.e.,

o⁽³⁾_j (t) =

F_j

f =1

N i=1

E.o⁽²⁾_{if j}(t)× I.o⁽²⁾j (t). (132)

4. Layer 4: Recurrent layer. Layer 4 contains J nodes. Node j in this layer computes the cen-troid defuzziﬁcation result of the internal vari-able hj(t + 1). Therefore, its output, o⁽⁴⁾(t), is calculated by the following equation:

hj(t + 1) = o⁽⁴⁾_j (t)

Application English alphabet A-Z

Sampling rate 11025 HZ

The longest pattern 6849 samples The shortest pattern 3009 samples

Frame width 256 samples/frame Speech frames Non-overlapped Feature extraction LPC cepstrum feature

Feature order 10

No. of training patterns 130(26*5) patterns No. of test patterns 390(26*15) patterns No. of training loops 7

Table 29: The speciﬁcation of the speech recogni-tion experiments.

= J

k=1o⁽³⁾_k (t)×vjk(t) J

k=1o⁽³⁾_k (t) (133) where vjk(t) = wjk0

P₂

p=1

wjkpx1(t− p + 1)

Q₂

q=1

wjk(P₂+q)h(t− q + 1).(134)

5. Layer 5. Layer 5 contains M nodes and each node represents one class. Note that M is the total number of pattern classes. The output of the mth node, o⁽⁵⁾m, represents the centroid defuzziﬁcation result, i.e.,

o⁽⁵⁾m(t) = J

j=1o⁽³⁾_j (t)×uj(t) J

j=1o⁽³⁾_j (t) . (135) Apparently, mif j, σif j, bj, cj, aij, and uj are the parameters that can be tuned to improve the per-formance of the network. We call these parameters adaptive parameters of the network.

3.4 Experimental Results and Discussion We do some experiments on the recognition of En-glish alphabets A-Z. The sampling rate is 11025 HZ. The frame width is 256 samples. There-fore, each frame is in a quasi-stationarity state of speech. All frames do not overlap each other. We adopt LPC cesptrum features as the input vectors of both system. The order of LPC cesptrum fea-tures is 10. The number of training patterns is 130, each alphabet has 5 training patterns. The number of testing patterns is 390, each alphabet has 15 testing patterns. All patterns of any one al-phabet are uttered by diﬀerent speaker. The spec-iﬁcation of the experiments is listed in Table 29.

Table 30 shows the number of frames of the longest and the shortest training patterns. The results of experiments are listed in Table 31. From this ta-ble, we see the storage space used in DTW is 2.8 times larger than that used in our approach. Fur-thermore, for the recognition rate, our approach is better than DTW.

We give an example to illustrate our experi-ments. The waveform of a training pattern and

The longest speech The shortest speech

A 17 17

B 16 17

C 22 25

D 15 17

E 16 19

F 15 18

G 17 17

H 18 22

I 17 18

J 19 20

K 19 21

L 16 18

M 17 18

N 17 19

O 14 19

P 18 20

Q 21 22

R 17 19

S 19 21

T 18 21

U 18 20

V 20 22

W 22 24

X 21 23

Y 17 20

Z 18 22

Table 30: The number of frames of the longest and the shortest training patterns.

Recognition Rate Storage Size (%) (No. of reference patterns )

DTW 80 130

Our approach 87 47

Table 31: A comparison between DTW and our neural approach.

Figure 15: The waveform of a training pattern.

Figure 16: The waveform of a test pattern.

an unknown pattern are shown in Figure 15 and Figure 16, respectively. The length of the train-ing pattern is much shorter than that of the test pattern. These two patterns have the same class

“F”. The LPC cesptrum features of these patterns are listed in Table 32 and Table 33, respectively.

The length of the training pattern is 3712 sam-ples. The span between two samples is 91 ns. The length of the test pattern is 5120 samples. How-ever, the test pattern is classiﬁed correctly to the class of the training pattern by our method.

The main reason for the recognition rate of our approach being better than that of DTW results in the similar uttered words(e.g., B, C, D, E, G, P, T, V, and Z). Our approach deals with these

LPC cepstrum features Frame 1 -1.06 -2.55 0.02 -0.98 2.86

-0.98 -0.33 -1.89 -1.30 -0.60 Frame 2 0.64 -3.36 4.04 -3.61 5.31 -3.08 1.57 -1.97 -0.42 -0.33 Frame 3 0.91 -4.24 4.75 -5.29 6.43 -4.43 2.54 -2.47 -0.21 -0.43 Frame 4 1.34 -4.62 5.74 -6.94 7.38 -6.26 3.62 -3.05 0.19 -0.45 Frame 5 1.37 -4.28 5.29 -7.03 6.43 -6.18 3.49 -2.71 0.29 -0.46 Frame 6 1.96 -4.45 5.31 -6.98 5.39 -5.45 2.48 -1.79 0.10 -0.38 Frame 7 1.72 -1.81 1.21 -2.41 1.59 -2.49 0.10 -0.06 -0.25 -0.31 Frame 8 2.78 -1.99 1.02 -1.49 0.38 -0.45 -0.65 1.07 -0.72 -0.01 Frame 9 -0.13 -1.31 0.87 -0.06 0.66 -0.22 0.97 0.81 -0.02 -0.08 Frame 10 -1.45 -2.90 -0.76 -0.68 0.08 -1.47 -1.15 -0.7 -0.34 -0.10 Frame 11 -1.74 -3.25 -2.66 -3.63 -2.34 -0.92 -0.94 -0.53 -0.70 -0.09 Frame 12 -2.52 -4.58 -4.47 -3.69 -2.42 -2.33 -2.02 -1.13 -0.62 -0.10 Frame 13 -2.78 -4.77 -4.99 -4.20 -2.60 -2.08 -1.22 -0.39 -0.12 -0.02 Frame 14 -1.38 -2.59 -1.27 -1.23 -0.28 -0.51 -0.21 0.71 -0.01 0.09

Table 32: The LPC cesptrum features of the train-ing pattern shown in Figure 15.

LPC cepstrum features Frame 1 -1.78 -2.67 -0.07 1.05 4.25

1.53 0.31 -0.98 -0.80 -0.49 Frame 2 -1.66 -2.75 -1.27 -0.67 1.41 -0.11 -0.99 -1.63 -1.19 -0.41 Frame 3 0.00 -1.53 -0.20 1.84 1.28 1.06 -0.06 -0.61 -0.54 -0.28 Frame 4 2.14 -1.87 1.71 -0.31 1.10 -0.36 -0.58 -0.13 -0.85 0.23 Frame 5 2.92 -3.44 2.68 -1.92 2.24 -1.94 0.42 0.06 -1.03 0.21 Frame 6 1.96 -2.40 1.32 -1.43 1.08 -1.41 -0.29 0.33 -0.69 -0.16 Frame 7 2.78 -3.97 2.96 -3.06 1.76 -2.02 -0.03 0.77 -1.11 0.00 Frame 8 2.76 -3.88 3.45 -4.09 2.22 -2.55 0.31 0.75 -0.98 -0.05 Frame 9 3.48 -3.25 1.21 -1.40 1.06 -1.89 -0.05 2.55 -2.32 0.31 Frame 10 2.34 -1.47 0.81 -1.69 0.66 -1.14 -1.13 2.27 -1.36 0.01 Frame 11 1.50 -1.14 1.37 -0.65 0.38 -0.77 -0.84 1.41 -0.48 -0.04 Frame 12 -1.63 -3.43 -1.43 -1.60 -0.77 -1.75 -1.73 -0.99 -0.61 -0.03 Frame 13 -1.98 -3.06 -1.75 -1.05 -0.79 -0.78 -0.91 -0.63 -0.37 -0.13 Frame 14 -2.63 -4.54 -3.68 -2.79 -1.90 -1.47 -0.61 -0.77 -0.11 -0.01 Frame 15 -3.04 -4.83 -2.71 -1.51 -0.44 -0.82 -0.12 -0.19 -0.17 -0.09 Frame 16 -3.23 -5.98 -4.86 -3.11 -2.30 -2.12 -2.10 -0.90 -0.36 -0.10 Frame 17 -3.25 -7.22 -6.11 -4.81 -1.61 -1.79 -1.52 -1.20 -0.51 -0.04 Frame 18 -2.82 -6.48 -5.09 -4.52 -2.47 -2.41 -1.83 -0.53 -0.20 0.03 Frame 19 -0.87 -3.37 -1.34 -2.73 -0.61 -1.17 0.05 0.12 -0.30 -0.03 Frame 20 0.68 0.03 0.79 -1.08 -1.35 0.10 -1.16 1.09 -0.30 0.07

Table 33: The LPC cesptrum features of the test pattern shown in Figure 16.

words by using the importance weights. From this improvement, each frame of a spoken word has its important factor. Our approach separates two similar uttered words by using the importance weights to enlarge the diﬀerence between them. As a result, our approach can classify correctly an un-known word for which DTW fails. The recognition rate of our approach is higher than that of DTW, e.g., 87% vs. 80%. Also, due to the capability of learning, our approach can automatically combine similar patterns. Therefore, we need less memory space to store training patterns. Moreover, it is not required to preﬁlter training patterns by hu-man experts.

The memory space used in our approach is less than that used in DTW. This is due to the high combination of those words that their uttering dif-fers slightly from each other. Similar training pat-terns of a word can be combined together with-out interfering with other words. Therefore, we need less memory space to store training patterns.

There are some words each only one requiring ref-erence space.

Furthermore, for DTW, it is necessary to prese-lect the training pattern to obtain a smaller system with the same recognition rate. In fact, preﬁlter-ing is diﬃcult for human experts. However, our neural approach does not have this disadvantage.

Therefore, our approach is better than DTW.

在文檔中靜、動態神經模糊建模技術及其在語言辨識上的應用(III)Static/Dynamic Neuro-Fuzzy Modeling Techniques and Its Application to Speech Recognition (頁 27-30)