Chapter 2. Methods and Materials
2.3. Speech recognition
2.3.1. Feature extraction…
The processes of feature extraction are as follow:
Segment speech into a series of frames
Do pre-emphasis and hamming window with frame
Auto-coefficient
Linear predictive coefficient
Cepstrum
Delta-cepstrum
Speech feature obtained (cepstrum + delta-cepstrum)
Figure 2.5 Feature extraction steps
1. Frame blocking:
The time duration of each frame is about 20-30 ms; if the frame duration is too large, we can’t catch the time-varying characteristics of the speech signals and lose the function of frame blocking. If the frame duration is too small, then we can’t extract valid speech features, and computation time is long. Usually the overlap is 1/2 to 2/3 of the original frame [9] [10]. The more overlapping, the more computation is needed. In our example, the speech sampling rate is 10000 sample/sec; frame size is set with 240 sample points, frame duration is 240/10000 =0.24 ms between 20 ms-30 ms; and the overlap of two frames is 80 sample points between 1/2 and 2/3.
14
2. Pre-emphasis:
The goal of pre-emphasis is to compensate for the suppression of the high-frequency production by human sound mechanisms, often by boosting the higher frequencies. We emphasize the high frequency of 9500 Hz. The formula of pre-emphasis is presented as:
S n X n n 0
S n X n n 1 0 (2-2)
3. Hamming window:
In order to concentrate frame energy, frames have to be multiplied with a hamming window. The formula of the hamming window is written as:
η n 0.54 o. 46 cos N 0 n N 1 (2-3) We show a sine wave, and a sine wave multiplied with a hamming window (Fig.
2.6) as follows:
Figure 2.6 Sine wave and sine wave after hamming window
15
4. Linear predictive coefficient (LPC)
LPC is a very important feature of speech; it is the first feature we obtained. The main purpose of LPC is that a speech sample can be predicted by linear combination of previous p samples, as follows:
S n A1 S n 1 A2 S n 2 Ap S n p
S n is a predictive sample
S n 1 ~S n p is S n preceding p speech samples
A1~Ap are coefficients of linear combination (2-4) The real signal and the predictive signal may signify an error; if the error is minimal, the coefficient of linear combination is a linear predictive coefficient. The formula regarding the difference of the real signal and the predictive signal is as follows:
E ∑ e n ∑ y n ∑ a k y n k (2-5)
First, we compute the auto-correlation. R[i] is the auto-correlation coefficient.
The order of linear predictive coefficient we set is 12; usually, speech recognition uses 8 to 14 orders, as follows:
R k
N
X n i X n i k , 0 k N 1
X i X i k X i 1 X i k 1 X i N 1X i N 1 k
(2-6)
Since we used 12 linear predictive coefficients, we computed R0 (0) to R0 (12).
After we obtained the auto-correlation, we used Durbin’s algorithm to solve an x m. After following the five steps, we can obtain the LPC
inverse matri proble Step1 E 0 R 0
Step2 K R i a R i j /E i 1
Step3 a K
Step4 a a Kia , 1 j i 1
Step5 E i 1 k E i 1 , a a
16
5. Cepstrum
After obtaining LPC, we use LPC with recursions to obtain cepstrum. This way can avoid complex computing. Formula (2-7) shows that how to get cepstrum features.
C1 a1
Cn an 1 m
n a C , 1 p
Cn ∑ 1 a C , (2-7) The main functions of cepstrum are representing spectrum peaks and less variation of speech features. A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it were a signal. Cepstrum could stand for speech recognition feature than LPC.
6. Delta-cepstrum
After doing partial differential with cepstrum is delta-cepstrum, it has the ability to resist noise. In our program, we chose the order of 14 cepstrum and delta-cepstrum as a feature vector. In the program implementation, we used the following formula to obtain delta-cepstrum parameters. The formula has 0~L-1 frames, C(m,n) as mth
formula is as follows:
frame, and nth delta-cepstrum; the
For No. 0 frame
For other frames (2-8)
2.3.2.
Training model (Construct speech model)
After obtaining the speech feature, we constructed speech models. First, we discussed the nature of states. State function is catching the variations in regard to the mouth. For example, for the word forward in Mandarin, the pronunciation is ‘ch’/
‘yi’/ ‘an’, it must have some frames belonging to ‘ch’, some belonging to ‘yi’ and the
17
remainder frames belonging to ‘an’. If we construct three states, the first state has ‘ch’
frames, the second has ‘yi’ frames and third has ‘an’ frames. We can also construct two or six states or other states. The graph of forward in Mandarin; the signal is composed of “ch”, “yi” and “an” (see Fig. 2.7).
Figure 2.7 Signal intensity graph in time-domain of forward in Mandarin
When we start to construct a model, we have to initial the relations with states and frames. Mean cut is a simple way to initial the relations. We use three states to illustrate mean cut, as follows:
Figure 2.8 Mean cut frames to states
Hidden Markov Models (HMM) are recognition models using probability statistics; the most salient characteristic of HMM is using two probability density functions to describe variations of speech signal: one is state transition probability and the other is state observation probability. HMM is the most important component of the training model. The steps of the training model using HMM are as follows:
18
Construct speech model (mean and covariance matrix) Mean cut to distribute frames to states
Refresh the relation of states and frames
Does the value of the total probability convergence?
Final speech mode;
No Yes
Figure 2.9 Training model steps
The restrictions of the relation of frames and states are as listed below:
1. Each state has at least a frame.
2. The distribution that dividing frames to states cannot be reversed. For example, if frame No. 10 belongs to state 2, it is impossible for frame No. 15 to belong to state 1.
At the step for computing the mean and covariance matrix, we should know which states are similar to which frames. A formula is chosen, Gaussian probability density, to compute the degree of similarity between frame and state. The formula is as follows:
D T 2π N|R | exp τT τR TR τT τR N: Feature vector dimensions
τT : Speech of training data feature vector
τR: The ith mixture expectation (mean) of state in the model
Ri: The ith mixture covariance matrix in the model (2-9) At the step of refreshing the relations between states and frames, we used the Viterbi algorithm, implemented by dynamic programming (DP), a famous algorithm.
The purpose of the Viterbi algorithm is to optimally distribute frames to states and achieve probability convergence (to get the maximum value of probability). The
19
Viterbi algorithm path constraint illustration is shown in Fig. 2.10 and the solution of the Viterbi algorithm shows the formula (2-10); the solution is a recursive method.
( ) i , j
D
( i 1 , j 1 )
D − −
( i 1 , j )
D −
Figure 2.10 The path constraint of the Viterbi algorithm
D i, j B O i , j max D i 1, j A j, j
D i 1, j 1 A j 1, j (2-10) After the above processing, the relation of frames and state is not mean cut. It may be shown as follows:
Figure 2.11 Final relations of frames and states
The first state has frames No. 1 to 3, the second state has frames 4 to 9, and the third state has the remaining frames.
HMM is a useful tool. We describe it roughly. More information on it can be searched.
20
2.3.3.
Data comparison
After the above processing, feature extraction and training model, an operation data comparison follows. The process of data comparison is as follows:
Command an instruction
Extract speech feature
Find matched data with speech models
Recognition outcome (the maximum score)
Figure 2.12 Data comparison steps
A user commands an instruction to our controller, and the controller executes feature extraction. Then the feature vector of instruction operates with speech models.
In our example, we constructed five models: forward, backward, left, right and stop.
Operating the speech models produced five scores; the maximum score is the outcome.
21
2.4. System architecture
The architecture of this system includes wheelchair module, joystick control module, system control module, speech analysis module and bio-signal (vocal cords) control module. The system architecture is presented in Fig. 2.13. When we say forward, the vocal cords will vibrate, and the signal is transmitted to the cRIO; our data acquisition and speech recognition program are in the cRIO. After the cRIO analyses the data, the outcome is generated and the wheelchair is instructed where to go or to stop.
Wheelchair module
Joystick module
System control module
Speech analysis module
Bio-signal control moduleFigure 2.13 System architecture
When we say an instruction like forward in Mandarin or operate the joystick.
The signal transmits to the cRIO. The cRIO receive the signal and analysis it through speech program in the cRIO. When the analysis is finished, it will output the outcome and output the control signal to the wheelchair. The Figure 2.14 shows the whole operating process.
22
Figure 2.14 Operating processes of our system
Focusing on the controller, we present a diagram to explain how the controller operates internally, as follows.
Figure 2.15 Operating process of controller
400MHz Analog Input
NI 9215
¾ System control module
RT processor, rugged, customized I/O
¾ Joystick module
The first choice control interface
England P.G. control system¾ Speech analysis module
Data acquisition
Speech recognitionControl signal input Control signal output Multivariate analysis Analysis outcome
¾ Bio-signal control system
Easy, vocal cords vibration23
Chapter 3: Experiment and Result
3.1. Experiment background
We used LabVIEW software as our development tool. The software was installed in our research as follows.
Table 3.1 Experiment software and serial numbers Software
NI LabVIEW Core Software 8.6 NI LabVIEW FPGA Module 8.6 NI LabVIEW Real-Time Module 8.6 NI-RIO 3.0.0
CompactRIO 3.0.1
Hardware equipments as Table 3.2
Table 3.2 Experiment’s hardware equipment and serial numbers Hardware
NI cRIO-9014 NI cRIO-9103 NI 9474 NI 9215
Nita EXB Standup Wheelchair
Carol Handheld Cardioids Dynamic Microphone GS-55
24
Our speecch recognitioon settings are as followw [10]:
Table 3.3 Speech recognnition settinngs P
with the remmaining 1 seet as
testi
English and Fukienese, and vocal cord
peech: forwa
F
Figure 3.9
Figure 3.1
Figu
Figur
9 Signal inte
0 Signal int
ure 3.11 Sig
re 3.12 Sign
ensity graph
tensity grap
gnal intensit
nal intensity
28
h of vocal c
ph of vocal c
ty graph of
y graph of s
ords vibrati
cords vibrat
speech: forw
speech: back
ion: right in
tion: stop in
ward in Eng
kward in En
n Mandarin
n Mandarin
glish
nglish
F
Fig
Fi
Figur
igure 3.13 S
gure 3.14 S
igure 3.15 S
re 3.16 Sign
Signal inten
Signal intens
Signal inten
nal intensity
29
nsity graph o
sity graph o
sity graph o
y graph of sp
of speech: l
of speech: ri
of speech: st
peech: forw
left in Engli
ight in Engl
top in Engli
ward in Fuki sh
lish
ish
ienese
Figure
Fig
Figu
Fig
e 3.17 Signa
gure 3.18 Si
ure 3.19 Sig
gure 3.20 Si
al intensity
ignal intens
gnal intensi
gnal intensi
30
graph of sp
sity graph of
ty graph of
ity graph of
peech: backw
f speech: lef
f speech: rig
f speech: sto
ward in Fuk
ft in Fukien
ght in Fukien
op in Fukien kienese
nese
nese
nese
3.2. Speech recognition result
At first, our speech recognition focused on Mandarin instructions. We used 2, 3 and 6 states to train the model. The correct rates were as follows:
Figure 3.21 Speech in Mandarin recognition correct rate
The correct rate in this situation is good, at least 80%. In this figure, we see that at 3 training sets, the correct rate is at least 95% whatever the state chosen. The more training sets and states, the more time to train the speech model. In consideration of the program execution time and correct rate, we think 3 training sets is a better choice.
The program execution time is not long and the correct rate is acceptable to users.
When we finished speech in Mandarin recognition, we became interested in vocal cord vibrations in Mandarin recognition; as a result, we recorded 25 instruction sets for experiments like the abovementioned. The correct rate result is shown in Fig.
3.22:
31
Figure 3.22 Correct rates of vocal cord vibration recognition
The vocal cord correct recognition rate was not good enough for us. Whatever training sets and states were chosen, the correct rate was under 90%, which was worse than speech in regard to Mandarin recognition.
We also recorded speech in English and Fukienese (see Figs. 3.23 and 3.24)
Figure 3.23 Correct rates of speech in English recognition
32
Figure 3.24 Correct rates of speech in Fukienese recognition
Speech in English is so interesting: 2 states and 3 states are better than 6 states.
Above all, the correct rate is not so good. There are many possible reasons: record error, pronunciation error, and even the probability that the recognition algorithm is unsuitable for English speech. The answer to this will require future experiments.
Speech in Fukienese also has a good correct rate between 3 and 20 trainings. In particular, the 3 states situation is outstanding in regard to the results obtained.
We focus on vocal cords vibration recognition; we collect 10 users vocal cords vibration data and to recognize it, half boys and half girls. The average age of boys is 24 years old, standard deviation is 0.4. The average age of girls is 24, standard deviation is 1.7. The training set is from 1 to 5, state is 2 and 3. We show the correct rate as follow:
33
0.0 0.2 0.4 0.6 0.8 1.0
0 2 4
Correct/Value
Training sets
Vocal Cords(Mandarin)
Ba Bb Bc Bd BeFigure 3.25 Recognition correct rates of five boys with 2 states training
0.0 0.2 0.4 0.6 0.8 1.0
0 2 4
Correct/Value
Training sets
Vocal cords(Mandarin)
Ga Gb Gc Gd GeFigure 3.26 Recognition correct rates of five girls with 2 states training
34
0.0 0.2 0.4 0.6 0.8 1.0
0 2 4
Correct/Value
Training sets
Vocal Cords(Mandarin)
Ba Bb Bc Bd BeFigure 3.27 Recognition correct rates of five boys with 3 states training
0.0 0.2 0.4 0.6 0.8 1.0
0 2 4
Correct/Value
Training sets
Vocal Cords(Mandarin)
Ga Gb Gc Gd GeFigure 3.28 Recognition correct rates of five girls with 3 states training
Some users get worse recognition rates in this system. There are many reasons.
They first time use this system and don’t familiar with system. Algorithm maybe is not perfect enough that vocal cords vibration should be more strongly.
35
3.3. Mechanical and electrical integration
Attaching the controller to the wheelchair is a fairly arduous task requiring knowledge on how to supply electricity to a motor via signal. Our controller can load at most 1 A current; current flow through the motor is about 2~3 A. At first, we set current flow through the motor and controller; it did not work. The controller cut off in protest. We used relays to solve this problem, four relays to control the motors and four relays to control the brakes. If we want to rotate the motor, we should first power on the brake; then, when the brake is off, the wheel can rotate. Motor control uses four relays: one for right wheel motor rotation, one for right wheel motor reverse rotation, one for left wheel motor rotation and one for left wheel motor reverse rotation. Brake control is the same as motor control. If the controller commands the wheelchair to operate, like forward, it should output a signal. We show the relations between instruction and motor/brake control.
Table 3.4 Relations between instruction and motor/brake control Instruction Motor and brake control
Forward Right wheel (brake) and left wheel (brake ) rotate (power on) Backward Right wheel (brake) and left wheel (brake ) reverse rotate
(power on)
Left Right wheel (brake) rotate (power on) and left wheel (brake ) reverse rotate ( power on)
Right Right wheel (brake) reverse rotate (power on) and left wheel (brake ) rotate ( power on)
Figure 3.29 shows the physical, mechanical and electrical integration of the controller and the wheelchair.
36
Figure 3.29: Physical, mechanical and electrical integration
In Figure 3.29, the number 1, which we indicate in red color and underline, is controller NI CompactRIO and the number 2, for which we also use red color and underline is relays, like the switch between the controller and motor.
Figure 3.30 shows the overall physical system architecture, including wheelchair, microphone, notebook, wheelchair and controller.
37
Figure 3.30: Overall physical system architecture
3.4. Control rule
When we command an instruction like left, how does the wheelchair run? At what angle does the wheelchair turn or what time does motor run? Anything concerning the security of users could lead to possible serious problems. At first, we did a conservative test as Table 3.5 shows:
Table 3.5 Conservative control rule
Instruction Motor rotation time
Forward 1 sec
Backward 0.2 sec
Left 0.3 sec
Right 0.3 sec
38
We ensured that this system offers security, although it was not practical in regard to utility for the user. But for us, it was inspiring to drive a wheelchair using speech. In order to satisfy the real world situation, we designed another method.
Table 3.6 Suitable control rules for present situation
Instruction Motor rotation time
Forward Rotate until any signal input
Backward Rotate until any signal input
Left 0.3 sec
Right 0.3 sec
Many research teams have done in-depth research on control rules, with their goal being satisfaction on the part of users. The latter method is more practical than the former. Thanks to controller FPGA module, data acquisition is rapid; the wheelchair will moves or stop depending on signal input.
39
Chapter 4: Discussion and Conclusion
4.1. Discussion
We have employed speech and vocal cord vibration to implement a type of mouth command control wheelchair. Below, we discuss some interesting problems.
4.1.1. Parameter setting in speech recognition
How can we improve the correct rate of speech recognition? A correct rate reaching 95% with 3 training sets is our demand, whatever the speech signals or vocal cord vibration signals. We discuss speech recognition parameters as follows:
1. Frame size:
A frame usually occupies 20~30 ms; we set 240 samples as a frame (sampling rate: 10000 sample/sec), occupying 24 ms; we tried 200 or 300 samples as a frame, and observed the correct rate.
2. Frame overlap:
Usually, a frame overlap occupies 1/2~2/3 frame size; we set the frame overlap at 160 samples (frame size: 240 points), occupying 2/3 frame size. We tried to set 1/2 or less 2/3 frame size as the frame overlap, and observed the correct rate.
4.1.2. Customized bio-signal acquiring module
In our present research, we have developed speech and vocal cords vibration recognition to control a wheelchair, although vocal cord vibration recognition is not as accurate as speech recognition. A bio-signal acquiring module should satisfy user needs and the situations they face. If possible, we will try to develop a new method, breathing vibration as a bio-signal acquiring method. We list bio-signal acquisition methods and tools.
40
Table 4.1 Bio-signal control interfaces and tools
Bio-signal Tool Speech recognition Dynamic microphone
Condenser microphone Vocal cords vibration Dynamic microphone
Condenser microphone Accelerometer
Breathing vibration Piezoelectric materials Accelerometer
We have developed speech recognition and vocal cord vibration detection and transmission using a dynamic microphone as a possibly suitable tool for each bio-signal method.
4.1.3. Human system
A mature speech recognition system is one wherein the user’s input of any sentence could train the model. It is difficult to pronounce some specific speech or sentence for some users. If the system could achieve this standard, it would be helpful to users. If users command an instruction which does not belong to any speech model, the system should be able to simply ignore/reject it. Recently, certain systems have devoted efforts to achieving this objective.
As mentioned in Chapter 3, we employed three languages, separately, to test speech recognition. There were good recognition rates in Mandarin and Fukienese, but that for English was not as good.
4.2. Conclusion
At first, we see some real cases about paraplegia patients and then to have an idea of vocal cords controller. In order to achieve our idea, we meet many problems.
First, we develop a speech recognition system on PC-base, after which we embedded the program into NI cRIO. The controller FPGA module handled user input signal and output signal to wheelchair. The controller real-time processor operated via a speech recognition algorithm. When we accomplished that program embedded into
41
cRIO, what remained to be done was mechanical and electrical integration.
We integrated the controller into the wheelchair and the battery in the wheelchair supplied power to the controller. The motors responded to the control signal.
With our background in computer science, in order to make the system perfect, we needed to cooperate with electrical integration engineers and physiotherapists who could explain the users’ actual needs.
If you want to know more about our system, you can to visit our website. We uploaded a demo video on our website. The hyperlink is as follows:
If you want to know more about our system, you can to visit our website. We uploaded a demo video on our website. The hyperlink is as follows: