In current research, the main focuses are effectiveness of employing ASR system in pronunciation diagnosis, and learners’ perceptions toward such innovative technology as an alternative for evaluating and facilitating their pronunciation performances. Therefore in the following chapter, literature review based on the wisdom of previous studies will be weaved through several threads, including how ASR has developed and how it does work and benefit language learners, all the way to how users perceive and evaluate its performance. After the six section of detailed exploration, the research questions will be proposed at the end of this chapter.
2.1 Automatic Speech Recognition Technology 2.1.1 A Revolutionary Character in CAPT
In the surging wave of technological advancement during 1980s, computers had been expected by teachers and researchers alike to assume a greater role of language learning facilitator. Among a variety of computer assisted language learning programs, one specific skill is particularly difficult to touch upon—speech and pronunciation (Cordier, 2009). Therefore with the emergence of automatic speech recognition (ASR) technology, it is hoped that the tool has the capability to initiate changes in computer
assisted pronunciation training (CAPT) and transform CAPT into an intelligent and user-adaptive learning companion (Ehsani & Knodt, 1998; Kim, 2006). In essence, automatic speech recognition is an independent, machine-based process of decoding the incoming speech and transcribing it based on predetermined model and algorithm (Levi & Suvorov, 2012). Of several kinds of speech engines embedded in ASR programs, the Hidden Markov Model (HMM) has been widely used and attested to be an effective method of handling a large number of vocabulary words by sophisticated statistical computations (Ehsani & Knodt, 1998; Hinks; 2003; Kim, 2006; Levi &
Suvorov, 2012). The intricate technical aspect of HMM is beyond the scope of the present study, though. In brief, when receiving spoken input, HMM-based program will do the calculation to compare the incoming speech and inbuilt model and then determine the degree of their similarity based on probability theory. Recently, many other engines have come into existence such as the Kaldi Speech Recognition Toolkit, recognized as efficient and easy to adapt by recent studies (Gaida et al. 2014). The rapid development of speech engines would indeed provoke another wave of growth in automatic speech recognition and open up a whole new page in the field of CAPT and ASA (Chen& Chen, 2018).
2.1.2 The Discrepancy Issue between ASR Systems and Human Raters
Typically, the full process of automatic speech recognition in assisting learning can be simplified as follows— the acoustic signals uttered by the speaker will be received via a microphone and temporarily stored in a recording space. To pinpoint the learner’s deviations from the native model, those signals are diagnosed in comparison with the inbulit model so as to tease apart acceptable and unacceptable utterances.
Within seconds, the ASR system generates automatic feedback on local pronunciation problems and on overall performance of nonnativeness displayed in learner’s speech (Precoda & Bratt, 2008). Ideally, the moment that learners receive the feedback, they can instantly grasp a rough idea of to which degree they detour from the model pronunciation and in what way they can finetune their utterances to approach higher accuracy. But its effectiveness relies heavily in the form which the feedback is shown;
the issue will be tackled in the section below.
In addition, whether the automatic speech recognition is reliable enough and provides consistent assessment for similar input, and whether it keeps a high degree of agreement with the evaluation of human raters have always been major concerns for teachers (Ashwell& Elam, 2017; Neri, Cucchiarini & Strik, 2003). Certainly, there is little possibility that humans and computers can render exactly the same evaluation result each time, as Bernstein & Franco (1996) mentioned in their study, “Humans and
machines process speech in fundamentally different ways.” Ehsani & Knodt (1998) also
specified differences between human and machine when it comes to recognizing and rating speech:
Complex cognitive processes account for the human ability to associate acoustic signals with meanings and intentions. For a computer, on the other hand, speech is essentially a series of digital values. However, despite these differences, the core problem of speech recognition is the same for both humans and machines: namely, of finding the best match between a given speech sound and its corresponding word string. Automatic speech recognition technology attempts to simulate and optimize this process computationally. (Ehsani & Knodt, 1998)
In spite of the distinct processing mechanisms between the two evaluators—computers and human beings, the goal of correctly diagnosing speech and offering effective feedback on learners’ problematic parts, however, is undoubtedly in the same direction.
Therefore, as an assistant or even a substitute for real instructors in pronunciation teaching and diagnosis, it is important that ASR technology should be consistent regarding the quality of feedback it produces and possess good sensitivity to decipher learners’ digital codes and to locate inappropriate pronunciation.
2.2 Automatic Speech Recognition for Pronunciation Instruction 2.2.1 Limitations in the Past
Before the birth of ASR technology, computers have already participated in helping pronunciation training in a limited but ASR-oriented fashion. The pronunciation training at that time presents such limitations as exercise types, the insufficient supply of teachers, the lack of real-time feedback, and so forth (Neri, Cucchiarini & Strik, 2001). In CAPT system without ASR technology, most of the
exercises intend to train receptive competence while few strives to urge learners to produce their own utterances. What’s more, learners using some systems are
responsible for comparing their own recorded voices with the native utterances completely by themselves. This has been seriously questioned since studies show that learners individually have a hard time interpreting the phonetic discrepancy between their mother tongue and the target language, not to speak of being able to extract useful information of mistakes that need adjustment (Chen, 2012; Luo, 2016). This requirement for learners is too rigorous and professional to bear.
On the other hand, for CAPT systems that enlist teachers to act as evaluators for the recorded practice voices, learners encounter similar problem compared to that of regular language classes at school. That is, the teacher-to-student percentage is unfavorable; students are given unequal amount of instruction time while some might
receive the feedback few minutes later than other companions. Unlike real-time or immediate intervention that has been believed to effectively hinder learners from committing repetitive errors, the delayed feedback that makes learners wait for even a short period of time can lessen learners’ motivation to keep doing exercises or contribute to false impression of certain pronunciation (Eskenazi, 1999).
In light of the addressed problems, ASR technology is seemingly the most optimistic candidate which creates an optimum environment for pronunciation practicing. Each individual learner is essentially unique; likewise, errors they produce are varying and deserve one-on-one tutor that listens and attends to every single nuance deviating from the model pronunciation. With ASR technology in hand, learners seem to have their private tutors who can concentrate on one learner at a time and walk them through the mist of pronunciation exercises (Chen, 2011; Chiu, et al., 2007; Eskenazi, 1999). Just as Eskenazi (1999) and many other previous researchers mentioned, if the teacher is likely to constantly monitor learners’ progress and employ proper remediation measures, under that circumstance, language learning would appear to be most efficient.
But in reality, such barriers as teacher-student ratio, time arrangement, exam pressure
and the like make it impossible for teachers to realize the goal of keeping track of each students’ progress. Therefore in the realm of pronunciation training, ASR shoulders this
mission to guide learners all the time without causing any fatigue. The features of
constant availability and patience naturally make ASR system an adequate learning tool (Eskenazi, 1999).
2.2.2 An Affectionate and Lovely Beginning
Despite of its great benefits in language learning area, initially, ASR is designed to other purposes. In fact, it is not until the late 1970s that ASR has its roots in educational field. Behind the very beginning of ASR technology for language learning lies an ever-touching episode. It is when Destombes at IBM France learns that his daughter is deaf that he sets forth to invent a system to help her practice speaking intelligibly (Eskenazi, 2009). In his ASR software, pitch and intensity versus time are displayed, and the interface is designed game-like. Though there are merely simple pitch displays and individual phone detection in this stage, it has marked a giant leap toward. From then on, slowly but steadily, ASR technology has been applied to pronunciation learning of different age groups or integrated into several kinds of speaking activities, which makes pronunciation learning a relatively realistic and engaging process (Elimat & AbuSeileek, 2014).
2.2.3 Deeper Investigation into the Principles for ASR-based CAPT
To develop a favorable ASR-based pronunciation training system, several principles should be observed as Eskenazi (1999) listed on the basis of Kenworthy’s
(1987) principles in successful pronunciation learning. The five principles are outlined and succinctly discussed in the following:
a) Learners must produce large quantities of sentences on their own.
It is an ideal scenario since so far most of the ASR-based programs hasn’t enabled or allowed learners to produce their own sentences; instead, learners are deemed as a rather passvie role while interacting with ASR. Learners can only practice ready-made sentences or dialogues. With regard to the technical limitation, there is little probability to practice sentences constructed by users. Their creativity to produce sentences is not encouraged in a sense. If a high degree of reliability of feedback is expected, it is demandable for the underlying speech recognizer to deal with unexpected utterances. Thus, this is the niche where researchers and programmers can work on.
b) Learners must receive pertinent corrective feedback.
The keyword here is “pertinent.” What teachers concern the most is the quality of error detection and appropriate feedback provision. If these two requirements are met, then worries from both teachers and learners are eased.
c) Learners must hear many different native models.
It is advised that learners have the opportunity to be exposed to native teachers from different backgrounds (Celce Murcia & Goodwin, 1991). In this way, learners
will collect more speech data and their speech repertoire would be expanded. Also, they might feel less threatened in face of English with different accents.
d) Prosody (amplitude, duration and pitch) must be emphasized.
Prosody is like the decorations that help show the emotions of the utterance.
Through prosody inclusive of intonation, pitch and so forth, speakers’ intended meaning can be fully expressed. Therefore, Chun (1998) thinks prosody should be taught in the beginning of language learning.
e) Learners should feel at ease in the language learning situation.
This issue has already been mentioned in the previous sections. But there are two other special points raised by Eskenazi (1999). One is if the amount of interruption, namely timely feedback, can be adapted to the extent which learners can tolerate, the
discouraging effects of correction could be avoidable. The other point is “false negative” should be prevented or minimized to the least, which has long been a
unsolved challenge in ASR-based pronunciation training programs (Eskenazi, 2009).
Sometimes, learners may speak correctly but are judged to be wrong and in turn passion for constant practicing is extinguished. In fact, studies have shown that learners informed of being wrong as they are right would feel strong negative emotions. On the contrary, learners informed of being right but in effect they are wrong (i.e. false positive) would feel less negative emotions. It is thus important to
deter false negative from taking place frequently.
In general, the first four principles describe the physical environment of the learning system, while the last item portrays learners’ mental environment like their emotional state or capacities for taking in information. Once the outer environment is blended concordantly with the inner one, it may foster an far more amiable speaking practicing atmosphere that breeds successful pronunciation advancement. Thanks to ASR technologies, what teachers can hardly take care of previously would be carefully treated then.
2.2.4 Diving into Procedures in ASR-based CAPT
After principles that ideally need to abide by in ASR-based pronunciation learning have been checked, it is equally important and worth a closer examination into its
mechanisms. How automatic speech recognition works from the very moment that it receives learners’ oral production to the instant that it returns feedback with detailed
evaluation marks is going to be covered here. As Neri, Cucchiarini and Strick (2003) proposed, there is an ideal sequence of five phases for pronunciation training programs powered by automatic speech technologies. They are 1) speech recognition, 2) scoring, 3) error detection, 4) error diagnosis and 5) feedback presentation showing respectively below:
1) Speech recognition: This is the very first and most significant step. The ASR engine decodes the incoming speech data and translates them into a string of words based on their phonetic and syntactic feature, the accuracy of which the subsequent steps highly depend upon. As Neri, et al. (2003) stated, the pedagogical contribution of ASR-based CAPT lies rightly in offering the evaluation of learners’ pronunciation quality. Hence, whether the ASR engine is reliable is a core question.
2) Scoring: In this step, learners are informed of their overall spoken utterances in the numerical form. By comparison between learners’ spoken output and that of reference native data, the ASR system would do immediate evaluation and report a definite score. The closer the learners’ utterances get to that of native models, the higher score is expected. Through the score, learners could obtain basic understanding of their overall pronunciation quality.
3) Error detection: Here in this phase, oral production by learners is filtered by ASR system while the problematic utterances would be filtered and highlighted. By means of locating certain sound errors within a word, learners’ attention would be
dragged to erroneous area that has to be specifically coped with more practices. It is believed that learners’ awareness would be particularly raised if errors are
accurately detected.
4) Error diagnosis: Providing learners with their errorneous pronunciation parts alone
is not enough, which is like we offer students fishing rods but we don’t teach them how to fish. After learners’ awareness has been raised, professional ASR system
would identify the specific type of error made by the student; moreover, the way to making improvement is going to be suggested. This mechanism is essential in that it is difficult for learners to identify the exact nature of their errors by merely resorting to their background knowledge.
5) Feedback presentation: This phase is the final and generalized one, which summarizes the information secured and displayed in the previous phases. Whether the overall score is presented as a graded bar or as a number on a given scale, it should be interpretable to every learner. Therefore, the design matters since learners will only be able to make full use of the information supplied by the ASR system if it is shown in a meaningful fashion.
2.3 ASR-based English Learning Programs and Its Exercise Design
There are various kinds of pronunciation training programs supported by ASR engines, such as TRACI Talk, Candle Talk, Tell Me More, MyET and many other emerging language learning programs focusing on developing speaking competencies of different languages. With the advancing of technology, most of them can be readily approached through personal computer or mobile devices. With reference to the content
design of them, learners will find oral exercises ranging from rudimentary vowel/
consonant pronunciation practice, daily conversation activities through role-play of different genres, to task-based crime scenario of looking into a mystery as a detective.
When exposing themselves to the ASR-enhanced environment, learners individually starts journeying down the path to oral skills augmentation.
Chen (2001) conducted a detailed review of five commercial ASR software programs, inclusive of CNN Interactive English, Syracuse English Comprehensive Learning Series, TeLL Me More Pro, TRACI Talk as well as Encarta Interactive English Learning. Even though these programs don’t quite fulfill the ideal learning conditions proposed by Carol Chapelle (1998), Chen found that their far-reaching influence cannot be understated since they encourage learners to generate far more spoken output than they usually do in the regular lecture classroom setting.
As for the learning content and exercise design of these five items, the material of CNN Interactive English is retrieved from a CBS TV comedy show, Caroline in the City. Learners can choose a particular role and say the lines out loud, and the ASR engine will do its part. In the Syracuse English Comprehensive Learning Series, it has similar activities like role-play along with videos. Besides, another type of activity is choosing the best response by saying out loud. On the screen, there are three options in response to the question raised by a virtual partner. In terms of TeLL Me More Pro, two
types of activities are included. One is to repeat after the selected words and sentences, basically one kind of training to build the bricks of pronunciation. The other is to choose
the most appropriate response to the question by saying it out loud. This knid of practice on the one hand evaluates students’ speaking accuracy, and on the other side tests their
background knowledge about how to react decently in a conversation.
The fourth introduced program is TRACI Talk, an acronym for Teacher Ranging Across the Computer Interface, indicating the hidden teacher ready for assistance at any moment within the computer. Acting as a role of sleuth, learners are involved in a string of task-based conversations in order to garner necessary clues from four suspects to unravel the mystery. In the process, all they have to do is keep speaking, whether making inquiries or answering questions clearly, which naturally immerses themselves in a gripping English-speaking environment. As players think they’re confident enough to identify the culprit, they can advance to the next stage and answer eight randomly chosen questions (out of a pool of 60). If they succeed in the eight challenges, they may have the chance to collect more favorable evidence. If they fail, then there will be more opportunities to engage in additional oral practices. In fact, those virtual partners they
converse with are their teachers in the game. Also, when learners get lost and want some instructions for the next step, they can say “Traci” into the microphone and a box
showing the task content will emerge. The invention of TRACI Talk combines speaking
improvement with game elements, making oral practices joyful.
The fifth one is Encarta Interactive English Learning, which mingles video, real-time 3D animations and ASR engines together. There are in total 10 units constituting Encarta Interactive English Learning, a software that aims at enhancing listening/
speaking/ grammar ability and enlarging one’s vocabulary size. Initially, learners need to create their own profiles that help keep track of the progress they make. Then, an abundance of over 360 activities and 10 formative assessment, called Virtual Challenge, awaits them. Oral output are elicited and examined when learners participate in the
speaking/ grammar ability and enlarging one’s vocabulary size. Initially, learners need to create their own profiles that help keep track of the progress they make. Then, an abundance of over 360 activities and 10 formative assessment, called Virtual Challenge, awaits them. Oral output are elicited and examined when learners participate in the