Feature extraction…

Chapter 2. Methods and Materials

2.3. Speech recognition

2.3.1. Feature extraction…

The processes of feature extraction are as follow:

Segment speech into a series of frames

Do pre-emphasis and hamming window with frame

Auto-coefficient

Linear predictive coefficient

Cepstrum

Delta-cepstrum

Speech feature obtained (cepstrum + delta-cepstrum)

Figure 2.5 Feature extraction steps

1. Frame blocking:

The time duration of each frame is about 20-30 ms; if the frame duration is too large, we can’t catch the time-varying characteristics of the speech signals and lose the function of frame blocking. If the frame duration is too small, then we can’t extract valid speech features, and computation time is long. Usually the overlap is 1/2 to 2/3 of the original frame [9] [10]. The more overlapping, the more computation is needed. In our example, the speech sampling rate is 10000 sample/sec; frame size is set with 240 sample points, frame duration is 240/10000 =0.24 ms between 20 ms-30 ms; and the overlap of two frames is 80 sample points between 1/2 and 2/3.

2. Pre-emphasis:

The goal of pre-emphasis is to compensate for the suppression of the high-frequency production by human sound mechanisms, often by boosting the higher frequencies. We emphasize the high frequency of 9500 Hz. The formula of pre-emphasis is presented as:

S n X n n 0

S n X n n 1 0 (2-2)

3. Hamming window:

In order to concentrate frame energy, frames have to be multiplied with a hamming window. The formula of the hamming window is written as:

η n 0.54 o. 46 cos _N 0 n N 1 (2-3) We show a sine wave, and a sine wave multiplied with a hamming window (Fig.

2.6) as follows:

Figure 2.6 Sine wave and sine wave after hamming window

4. Linear predictive coefficient (LPC)

LPC is a very important feature of speech; it is the first feature we obtained. The main purpose of LPC is that a speech sample can be predicted by linear combination of previous p samples, as follows:

S n A1 S n 1 A2 S n 2 Ap S n p

S n is a predictive sample

S n 1 ~S n p is S n preceding p speech samples

A1~Ap are coefficients of linear combination (2-4) The real signal and the predictive signal may signify an error; if the error is minimal, the coefficient of linear combination is a linear predictive coefficient. The formula regarding the difference of the real signal and the predictive signal is as follows:

E ∑ e n ∑ y n ∑ a k y n k (2-5)

First, we compute the auto-correlation. R[i] is the auto-correlation coefficient.

The order of linear predictive coefficient we set is 12; usually, speech recognition uses 8 to 14 orders, as follows:

R k

X n i X n i k , 0 k N 1

X i X i k X i 1 X i k 1 X i N 1X i N 1 k

(2-6)

Since we used 12 linear predictive coefficients, we computed R0 (0) to R0 (12).

After we obtained the auto-correlation, we used Durbin’s algorithm to solve an x m. After following the five steps, we can obtain the LPC

inverse matri proble Step1 E 0 R 0

Step2 K R i a R i j /E i 1

Step3 a K

Step4 a a Kia , 1 j i 1

Step5 E i 1 k E i 1 , a a

5. Cepstrum

After obtaining LPC, we use LPC with recursions to obtain cepstrum. This way can avoid complex computing. Formula (2-7) shows that how to get cepstrum features.

C1 a1

Cn an 1 m

n a C , 1 p

Cn ∑ 1 a C , (2-7) The main functions of cepstrum are representing spectrum peaks and less variation of speech features. A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it were a signal. Cepstrum could stand for speech recognition feature than LPC.

6. Delta-cepstrum

After doing partial differential with cepstrum is delta-cepstrum, it has the ability to resist noise. In our program, we chose the order of 14 cepstrum and delta-cepstrum as a feature vector. In the program implementation, we used the following formula to obtain delta-cepstrum parameters. The formula has 0~L-1 frames, C(m,n) as mth

formula is as follows:

frame, and nth delta-cepstrum; the

For No. 0 frame

For other frames (2-8)

2.3.2.

Training model (Construct speech model)

After obtaining the speech feature, we constructed speech models. First, we discussed the nature of states. State function is catching the variations in regard to the mouth. For example, for the word forward in Mandarin, the pronunciation is ‘ch’/

‘yi’/ ‘an’, it must have some frames belonging to ‘ch’, some belonging to ‘yi’ and the

remainder frames belonging to ‘an’. If we construct three states, the first state has ‘ch’

frames, the second has ‘yi’ frames and third has ‘an’ frames. We can also construct two or six states or other states. The graph of forward in Mandarin; the signal is composed of “ch”, “yi” and “an” (see Fig. 2.7).

Figure 2.7 Signal intensity graph in time-domain of forward in Mandarin

When we start to construct a model, we have to initial the relations with states and frames. Mean cut is a simple way to initial the relations. We use three states to illustrate mean cut, as follows:

Figure 2.8 Mean cut frames to states

Hidden Markov Models (HMM) are recognition models using probability statistics; the most salient characteristic of HMM is using two probability density functions to describe variations of speech signal: one is state transition probability and the other is state observation probability. HMM is the most important component of the training model. The steps of the training model using HMM are as follows:

Construct speech model (mean and covariance matrix) Mean cut to distribute frames to states

Refresh the relation of states and frames

Does the value of the total probability convergence?

Final speech mode;

No Yes

Figure 2.9 Training model steps

The restrictions of the relation of frames and states are as listed below:

1. Each state has at least a frame.

2. The distribution that dividing frames to states cannot be reversed. For example, if frame No. 10 belongs to state 2, it is impossible for frame No. 15 to belong to state 1.

At the step for computing the mean and covariance matrix, we should know which states are similar to which frames. A formula is chosen, Gaussian probability density, to compute the degree of similarity between frame and state. The formula is as follows:

D _T 2π ^N|R | exp τ_T τ_R ^TR τ_T τ_R N: Feature vector dimensions

τ_T: Speech of training data feature vector

τ_R: The ith mixture expectation (mean) of state in the model

Ri: The ith mixture covariance matrix in the model (2-9) At the step of refreshing the relations between states and frames, we used the Viterbi algorithm, implemented by dynamic programming (DP), a famous algorithm.

The purpose of the Viterbi algorithm is to optimally distribute frames to states and achieve probability convergence (to get the maximum value of probability). The

Viterbi algorithm path constraint illustration is shown in Fig. 2.10 and the solution of the Viterbi algorithm shows the formula (2-10); the solution is a recursive method.

( ) ⁱ ^, ^j

D

( ⁱ ¹ ^, ^j ¹ )

D − −

( ⁱ ¹ ^, ^j )

D −

Figure 2.10 The path constraint of the Viterbi algorithm

D i, j B O i , j max D i 1, j A j, j

D i 1, j 1 A j 1, j (2-10) After the above processing, the relation of frames and state is not mean cut. It may be shown as follows:

Figure 2.11 Final relations of frames and states

The first state has frames No. 1 to 3, the second state has frames 4 to 9, and the third state has the remaining frames.

HMM is a useful tool. We describe it roughly. More information on it can be searched.

2.3.3.

Data comparison

After the above processing, feature extraction and training model, an operation data comparison follows. The process of data comparison is as follows:

Command an instruction

Extract speech feature

Find matched data with speech models

Recognition outcome (the maximum score)

Figure 2.12 Data comparison steps

A user commands an instruction to our controller, and the controller executes feature extraction. Then the feature vector of instruction operates with speech models.

In our example, we constructed five models: forward, backward, left, right and stop.

Operating the speech models produced five scores; the maximum score is the outcome.

2.4. System architecture

The architecture of this system includes wheelchair module, joystick control module, system control module, speech analysis module and bio-signal (vocal cords) control module. The system architecture is presented in Fig. 2.13. When we say forward, the vocal cords will vibrate, and the signal is transmitted to the cRIO; our data acquisition and speech recognition program are in the cRIO. After the cRIO analyses the data, the outcome is generated and the wheelchair is instructed where to go or to stop.

Wheelchair module

Joystick module

System control module

Speech analysis module

Bio-signal control module

Figure 2.13 System architecture

When we say an instruction like forward in Mandarin or operate the joystick.

The signal transmits to the cRIO. The cRIO receive the signal and analysis it through speech program in the cRIO. When the analysis is finished, it will output the outcome and output the control signal to the wheelchair. The Figure 2.14 shows the whole operating process.

Figure 2.14 Operating processes of our system

Focusing on the controller, we present a diagram to explain how the controller operates internally, as follows.

Figure 2.15 Operating process of controller

400MHz Analog Input

NI 9215

¾ System control module

RT processor, rugged, customized I/O

¾ Joystick module

The first choice control interface

England P.G. control system

¾ Speech analysis module

Data acquisition

Speech recognition

Control signal input Control signal output Multivariate analysis Analysis outcome

¾ Bio-signal control system

Easy, vocal cords vibration

Chapter 3: Experiment and Result

3.1. Experiment background

We used LabVIEW software as our development tool. The software was installed in our research as follows.

Table 3.1 Experiment software and serial numbers Software

NI LabVIEW Core Software 8.6 NI LabVIEW FPGA Module 8.6 NI LabVIEW Real-Time Module 8.6 NI-RIO 3.0.0

CompactRIO 3.0.1

Hardware equipments as Table 3.2

Table 3.2 Experiment’s hardware equipment and serial numbers Hardware

NI cRIO-9014 NI cRIO-9103 NI 9474 NI 9215

Nita EXB Standup Wheelchair

Carol Handheld Cardioids Dynamic Microphone GS-55

Our speecch recognitioon settings are as followw [10]:

Table 3.3 Speech recognnition settinngs P

with the remmaining 1 seet as

testi

English and Fukienese, and vocal cord

peech: forwa

Figure 3.9

Figure 3.1

Figu

Figur

9 Signal inte

0 Signal int

ure 3.11 Sig

re 3.12 Sign

ensity graph

tensity grap

gnal intensit

nal intensity

h of vocal c

ph of vocal c

ty graph of

y graph of s

ords vibrati

cords vibrat

speech: forw

speech: back

ion: right in

tion: stop in

ward in Eng

kward in En

n Mandarin

glish

nglish

Fig

Figur

igure 3.13 S

gure 3.14 S

igure 3.15 S

re 3.16 Sign

Signal inten

Signal intens

Signal inten

nal intensity

nsity graph o

sity graph o

y graph of sp

of speech: l

of speech: ri

of speech: st

peech: forw

left in Engli

ight in Engl

top in Engli

ward in Fuki sh

lish

ish

ienese

Figure

Fig

Figu

Fig

e 3.17 Signa

gure 3.18 Si

ure 3.19 Sig

gure 3.20 Si

al intensity

ignal intens

gnal intensi

graph of sp

sity graph of

ty graph of

ity graph of

peech: backw

f speech: lef

f speech: rig

f speech: sto

ward in Fuk

ft in Fukien

ght in Fukien

op in Fukien kienese

nese

3.2. Speech recognition result

At first, our speech recognition focused on Mandarin instructions. We used 2, 3 and 6 states to train the model. The correct rates were as follows:

Figure 3.21 Speech in Mandarin recognition correct rate

The correct rate in this situation is good, at least 80%. In this figure, we see that at 3 training sets, the correct rate is at least 95% whatever the state chosen. The more training sets and states, the more time to train the speech model. In consideration of the program execution time and correct rate, we think 3 training sets is a better choice.

The program execution time is not long and the correct rate is acceptable to users.

When we finished speech in Mandarin recognition, we became interested in vocal cord vibrations in Mandarin recognition; as a result, we recorded 25 instruction sets for experiments like the abovementioned. The correct rate result is shown in Fig.

3.22:

Figure 3.22 Correct rates of vocal cord vibration recognition

The vocal cord correct recognition rate was not good enough for us. Whatever training sets and states were chosen, the correct rate was under 90%, which was worse than speech in regard to Mandarin recognition.

We also recorded speech in English and Fukienese (see Figs. 3.23 and 3.24)

Figure 3.23 Correct rates of speech in English recognition

Figure 3.24 Correct rates of speech in Fukienese recognition

Speech in English is so interesting: 2 states and 3 states are better than 6 states.

Above all, the correct rate is not so good. There are many possible reasons: record error, pronunciation error, and even the probability that the recognition algorithm is unsuitable for English speech. The answer to this will require future experiments.

Speech in Fukienese also has a good correct rate between 3 and 20 trainings. In particular, the 3 states situation is outstanding in regard to the results obtained.

We focus on vocal cords vibration recognition; we collect 10 users vocal cords vibration data and to recognize it, half boys and half girls. The average age of boys is 24 years old, standard deviation is 0.4. The average age of girls is 24, standard deviation is 1.7. The training set is from 1 to 5, state is 2 and 3. We show the correct rate as follow:

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4

Correct/Value

Training sets

Vocal Cords(Mandarin)

^Ba ^Bb ^Bc ^Bd ^Be

Figure 3.25 Recognition correct rates of five boys with 2 states training

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4

Correct/Value

Training sets

Vocal cords(Mandarin)

^Ga ^Gb ^Gc ^Gd ^Ge

Figure 3.26 Recognition correct rates of five girls with 2 states training

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4

Correct/Value

Training sets

Vocal Cords(Mandarin)

^Ba ^Bb ^Bc ^Bd ^Be

Figure 3.27 Recognition correct rates of five boys with 3 states training

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4

Correct/Value

Training sets

Vocal Cords(Mandarin)

Ga Gb Gc Gd Ge

Figure 3.28 Recognition correct rates of five girls with 3 states training

Some users get worse recognition rates in this system. There are many reasons.

They first time use this system and don’t familiar with system. Algorithm maybe is not perfect enough that vocal cords vibration should be more strongly.

3.3. Mechanical and electrical integration

Attaching the controller to the wheelchair is a fairly arduous task requiring knowledge on how to supply electricity to a motor via signal. Our controller can load at most 1 A current; current flow through the motor is about 2~3 A. At first, we set current flow through the motor and controller; it did not work. The controller cut off in protest. We used relays to solve this problem, four relays to control the motors and four relays to control the brakes. If we want to rotate the motor, we should first power on the brake; then, when the brake is off, the wheel can rotate. Motor control uses four relays: one for right wheel motor rotation, one for right wheel motor reverse rotation, one for left wheel motor rotation and one for left wheel motor reverse rotation. Brake control is the same as motor control. If the controller commands the wheelchair to operate, like forward, it should output a signal. We show the relations between instruction and motor/brake control.

Table 3.4 Relations between instruction and motor/brake control Instruction Motor and brake control

Forward Right wheel (brake) and left wheel (brake ) rotate (power on) Backward Right wheel (brake) and left wheel (brake ) reverse rotate

(power on)

Left Right wheel (brake) rotate (power on) and left wheel (brake ) reverse rotate ( power on)

Right Right wheel (brake) reverse rotate (power on) and left wheel (brake ) rotate ( power on)

Figure 3.29 shows the physical, mechanical and electrical integration of the controller and the wheelchair.

Figure 3.29: Physical, mechanical and electrical integration

In Figure 3.29, the number 1, which we indicate in red color and underline, is controller NI CompactRIO and the number 2, for which we also use red color and underline is relays, like the switch between the controller and motor.

Figure 3.30 shows the overall physical system architecture, including wheelchair, microphone, notebook, wheelchair and controller.

Figure 3.30: Overall physical system architecture

3.4. Control rule

When we command an instruction like left, how does the wheelchair run? At what angle does the wheelchair turn or what time does motor run? Anything concerning the security of users could lead to possible serious problems. At first, we did a conservative test as Table 3.5 shows:

Table 3.5 Conservative control rule

Instruction Motor rotation time

Forward 1 sec

Backward 0.2 sec

Left 0.3 sec

Right 0.3 sec

We ensured that this system offers security, although it was not practical in regard to utility for the user. But for us, it was inspiring to drive a wheelchair using speech. In order to satisfy the real world situation, we designed another method.

Table 3.6 Suitable control rules for present situation

Instruction Motor rotation time

Forward Rotate until any signal input

Backward Rotate until any signal input

Left 0.3 sec

Right 0.3 sec

Many research teams have done in-depth research on control rules, with their goal being satisfaction on the part of users. The latter method is more practical than the former. Thanks to controller FPGA module, data acquisition is rapid; the wheelchair will moves or stop depending on signal input.

Chapter 4: Discussion and Conclusion

4.1. Discussion

We have employed speech and vocal cord vibration to implement a type of mouth command control wheelchair. Below, we discuss some interesting problems.

4.1.1. Parameter setting in speech recognition

How can we improve the correct rate of speech recognition? A correct rate reaching 95% with 3 training sets is our demand, whatever the speech signals or vocal cord vibration signals. We discuss speech recognition parameters as follows:

1. Frame size:

A frame usually occupies 20~30 ms; we set 240 samples as a frame (sampling rate: 10000 sample/sec), occupying 24 ms; we tried 200 or 300 samples as a frame, and observed the correct rate.

2. Frame overlap:

Usually, a frame overlap occupies 1/2~2/3 frame size; we set the frame overlap at 160 samples (frame size: 240 points), occupying 2/3 frame size. We tried to set 1/2 or less 2/3 frame size as the frame overlap, and observed the correct rate.

4.1.2. Customized bio-signal acquiring module

In our present research, we have developed speech and vocal cords vibration recognition to control a wheelchair, although vocal cord vibration recognition is not as accurate as speech recognition. A bio-signal acquiring module should satisfy user needs and the situations they face. If possible, we will try to develop a new method, breathing vibration as a bio-signal acquiring method. We list bio-signal acquisition methods and tools.

Table 4.1 Bio-signal control interfaces and tools

Bio-signal Tool Speech recognition Dynamic microphone

Condenser microphone Vocal cords vibration Dynamic microphone

Condenser microphone Accelerometer

Breathing vibration Piezoelectric materials Accelerometer

We have developed speech recognition and vocal cord vibration detection and transmission using a dynamic microphone as a possibly suitable tool for each bio-signal method.

4.1.3. Human system

A mature speech recognition system is one wherein the user’s input of any sentence could train the model. It is difficult to pronounce some specific speech or sentence for some users. If the system could achieve this standard, it would be helpful to users. If users command an instruction which does not belong to any speech model, the system should be able to simply ignore/reject it. Recently, certain systems have devoted efforts to achieving this objective.

As mentioned in Chapter 3, we employed three languages, separately, to test speech recognition. There were good recognition rates in Mandarin and Fukienese, but that for English was not as good.

4.2. Conclusion

At first, we see some real cases about paraplegia patients and then to have an idea of vocal cords controller. In order to achieve our idea, we meet many problems.

First, we develop a speech recognition system on PC-base, after which we embedded the program into NI cRIO. The controller FPGA module handled user input signal and output signal to wheelchair. The controller real-time processor operated via a speech recognition algorithm. When we accomplished that program embedded into

cRIO, what remained to be done was mechanical and electrical integration.

We integrated the controller into the wheelchair and the battery in the wheelchair supplied power to the controller. The motors responded to the control signal.

With our background in computer science, in order to make the system perfect, we needed to cooperate with electrical integration engineers and physiotherapists who could explain the users’ actual needs.

If you want to know more about our system, you can to visit our website. We uploaded a demo video on our website. The hyperlink is as follows:

在文檔中聲振式輪椅之方向控制 (頁 24-0)

Chapter 2. Methods and Materials

2.3. Speech recognition

2.3.1. Feature extraction…

Training model (Construct speech model)

( ) i , j

D

( i 1 , j 1 )

D − −

( i 1 , j )

D −

Data comparison

2.4. System architecture











¾ System control module





¾ Joystick module





¾ Speech analysis module





¾ Bio-signal control system



Chapter 3: Experiment and Result

3.1. Experiment background

3.2. Speech recognition result

Vocal Cords(Mandarin)

Vocal cords(Mandarin)

Vocal Cords(Mandarin)

Vocal Cords(Mandarin)

3.3. Mechanical and electrical integration

3.4. Control rule

Chapter 4: Discussion and Conclusion

4.1. Discussion

4.1.1. Parameter setting in speech recognition

4.1.2. Customized bio-signal acquiring module

4.1.3. Human system

4.2. Conclusion

( ) ⁱ ^, ^j

( ⁱ ¹ ^, ^j ¹ )

( ⁱ ¹ ^, ^j )