以語音辨識系統診斷高中生發音困難之評估研究

全文

(1)國立臺灣師範大學英語學系碩. 士. 論. 文. Master’s Thesis Department of English National Taiwan Normal University. An Evaluation Study on Using an Automatic Speech Recognition System to Identify EFL Students’ Pronunciation Problems 以語音辨識系統診斷高中生發音困難之評估研究. 指導教授：陳浩然博士 Advisor: Dr. Hao-Jan Chen 研究生：葉伊婷 Student: Yi-Ting Yeh. 中華民國一○八年八月 August 2019.

(2) 摘要. 對英語為外語的學習者來說，英語口說能力一直都是相當重要的技能。然由於諸多限制，口說能力的涵養時常受忽略且甚少被教學。隨著語音辨識技術（即 ASR，Automatic Speech Recognition）的發展，教師們開始有更多機會鍛鍊學生的口語能力。本研究之目的，即是探究學習者如何與一個載有語音辨識技術的免費網站 LearnMode 互動，此網站不僅讓教師能自行設計練習題，也能幫助在學生低焦慮感的環境中，逐漸熟習開口說英語。對著 ASR 系統練習同時，此系統也在為學習者診斷其問題發音，並以顏色標記或評語的方式，提供學習者即時的修正性回饋。在本研究中，總計有 66 名高中學生成為受試者，他們完成了 20 個根據易混淆母音/子音設計的口說任務。此外，受試者也在完成任務後填寫了一份問卷，當中 10 位更接受了一對一的訪談，藉此深入了解他們對於語音辨識科技的觀感與態度。研究結果顯示，語音辨識工具與兩名老師之偵錯具有相當高的一致性，自二十個單元隨機抽樣出之五個單元中，有四個單元 ASR 工具與老師有超過八成的相似度。亦即，ASR 工具與老師分別找出的發音錯字，有百分之八十五是相同的。此外，根據問卷與訪談結果也可得知，學生樂於在練口說時有 ASR 系統的協助，也因身處無同儕、師長壓力環境而更願意開口練習；不過他們仍渴望有老師在一旁，幫助他們即時改善有問題的發音。有關系統所提供的即時回饋機制，受試者認為顏色標記是有助益的，但他們也希望能有進一步的引導改善的指示。希上述 i.

(3) 研究成果能對有意協助學生提升口說能力的教師，以及計畫發展語言學習相關之語音辨識技術的研究者有綿薄貢獻。關鍵字：語音辨識、英語為外語學習者、口說、困難發音、人機評分比較、觀感. ii.

(4) ABSTRACT English speaking ability has been highly recognized as an important skill for EFL learners. Due to many constraints, though, the speaking skill cultivation is often neglected and seldom taught at school. With the developing of automatic speech recognition technologies, teachers can provide students with more opportunities to train their oral abilities. The current study investigates how learners interact with a website named LearnMode, enhanced by automatic speech technology which allows teachers to create their own speaking exercises and helps EFL learners get accustomed to speaking English under a low anxiety environment. In the meantime, the ASR system can diagnose their problematic pronunciation and offer immediate corrective feedback in the form of color highlight and comments. There are in total 66 senior high school students invited to complete 20 tasks with regard to difficult pairs of vowels and consonants. One questionnaire and one-on-one interviewed are also administered to probe into the learners’ perceptions and attitudes of the speech technologies. The result indicates that there is a high degree of agreement of the error detection between the automatic speech recognition system and human raters. In five randomly selected units among twenty, there are four units showing that over eighty-five percent of mispronounced words located by ASR technology and teachers respectively are the same. Also, it is shown that learners enjoy ASR assistance and are more willing to speak iii.

(5) English but they still want teachers to help them refine their problematic sounds. With regard to the immediate feedback mechanism, participants consider the color highlight helpful, but they would love to have further instructions on how to make the adjustment. These findings can serve as useful information for teachers who would like to incorporate speaking enhancement into their teaching and for researchers who intend to develop better ASR technologies for language teaching and learning.. Keywords Automatic speech recognition, EFL learners, Speaking, Difficult pronunciation, ASR and human ratings, Perceptions toward ASR. iv.

(6) ACKNOWLEDGEMENT On my arduous journey to the completion of the present research, I would like to extend my sincerest gratitude to my mentor Prof. Howard, who always kindly and patiently encourages me to conquer difficulties that life throws at me. Also, thanks to Prof. Howard’s relentless efforts into the English learning website Cool English that aims at helping Taiwanese learners improve their English abilities, I was honorably granted the chance to participate in the meaningful project and thus exposed to the issue of automatic speech recognition technology. Had it not been for Professor’s guidance, I would not have been inspired to investigate this interesting research topic and derived much pleasure from the interaction with the participants of the study. Thank you so much, Professor Howard! My great thankfulness also goes to the committee members, Prof. Berlin and Prof. Hsueh-Ying. My thesis will not be complete without their valuable suggestions and genuine encouragement. In addition to Professors’ kind assistance in both academic and spiritual aspects, my battle with the thesis is not alone since there are lots of friends and family members accompanying me all the time. I want to express my deep gratefulness to my graduate school friends, Double, Donna and Lisa for listening to me and ease my anxiety whenever I feel frustrated. My heartfelt gratefulness also goes to my dearest SGI comrades, Sabrina C., Jessie, Gina, Yen, Pei-Hua, Jia-Yi, Ru-Yi, Shu-Min, Wen-Wen, v.

(7) Sae, Daisaku Ikeda Sensei and many other friends who unfailingly support me with their warm hearts and words of wisdom. I truly couldn’t make it without their persistent trust and confidence in me. Most importantly, I would love to show my profound gratitude and deepest love to my dearest mom, who brings me up with great patience and courageously conquers all kinds of hardships in life with her remarkable resilience. The role model she sets definitely establishes my optimistic perspective of life in face of any adversity. Thanks to all of those important people I meet and obstacles I encounter, I am able to become a better person and eventually accomplish this thesis despite many unexpected ups and downs on the path to the destination. Thank you all, professors, mom and my friends, and I’ll work even harder in return for your kindness and great love. Last but not least, I want to quote a passage written by Daisaku Ikeda that inspires me to a great extent whenever I feel disappointed. “Whether we regard difficulties in life as misfortunes or whether we view them as good fortune depends entirely on how much we have forged our inner determination. It all depends on our attitude or inner state of life. With a dauntless spirit, we can lead a cheerful and thoroughly enjoyable life. We can develop a “self” of such fortitude that we can look forward to life’s trials and tribulations with a sense of profound elation and joy: “Come on obstacles! I’ve been expecting you! This is the chance that I’ve been waiting for!” vi.

(8) Table of Contents CHAPTER ONE Introduction 1.1 Research Background 1.2 Problem Statement 1.3 Purpose of Study 1.4 Significance of Study 1.5 Definition of Terms CHAPTER TWO. 1 1 1 3 6 10 12 13. Literature Review 2.1 Automatic Speech Recognition Technology 2.1.1 A Revolutionary Character in CAPT 2.1.2 The Discrepancy Issue between ASR Systems and Human Raters. 13 13 13 15. 2.2 Automatic Speech Recognition for Pronunciation Instruction 2.2.1 Limitations in the Past 2.2.2 An Affectionate and Lovely Beginning 2.2.3 Deeper Investigation into the Principles for ASR-based CAPT 2.2.4 Diving into Procedures in ASR-based CAPT 2.3 ASR-based English Learning Programs and Its Exercise Design 2.4 Different Types of Feedback in ASR Technologies 2.4.1 Summary of the Feedback Types in ASR Programs 2.5 Automatic Speech Scoring and Its Accuracy 2.6 Learners’ Perceptions of Using Automatic Speech Technologies 2.7 The Summary of Literature Review 2.8 Research Questions CHAPTER THREE Methodology 3.1 Participants 3.2 Procedure of Study. 17 17 19 19 22 24 35 38 40 42 43 44 46 46 46 48. 3.3.1 The Oral Practice Task: 10 pairs of difficult pronunciation to EFL learners 50 3.3.2 LearnMode, a free learning website benefiting both teacher& students 51 3.3.3 Feedback form, Questionnaire and Interview 55 3.3.4 Human Raters and the Scoring 57 3.4 Data analysis procedure 58 3.4.1 Similarities and differences between the result of ASR and human vii.

(9) raters 58 3.4.2 The feedback form and responses to the questionnaire and interviews 59 CHAPTER FOUR 60 Results and Discussion 60 4.1 Results Evaluated by Human Raters and the ASR Technology 60 4.2 The Problematic Words Ranked within 100th Detected by the ASR System 73 4.2.1 The 100th Problematic Words vs. the Basic 2000 Vocabulary of Junior High 76 4.3 The Relation between Participants’ Performance and the Text Nature 77 4.4 The participants’ responses to the questionnaires 79 4.4.1 Background information 79 4.4.2 Participants’ perceptions of four English skills 81 4.4.3 Participants’ emotions/ feelings while dealing with English speaking 81 4.4.4 Participants’ perceptions of feedback types and overall comments 82 4.4.5 Participants’ perceptions of the ASR system (open-ended question) 85 4.4.6 One-on-One Interviews 87 CHAPTER FIVE 95 Conclusion 95 5.1 Brief Summary and Pedagogical Implications 5.2 The Limitations and Future Research Directions REFERENCES APPENDICES Appendix A. The Oral Practice Materials Appendix B. Feedback Form Appendix C. Questionnaire Appendix D. One-on-one Open-ended Interview Questions Appendix E. Mispronounced Wordlist and Its Error Rate. viii. 95 97 100 104 104 110 113 116 117.

(10) CHAPTER ONE Introduction 1.1 Research Background As a citizen in the globalized 21st century, learning English as a foreign or second language is unquestionably the requisite capability, and Asia is no exception in this trend (Wang & Young, 2015). However in Taiwan, learning the subject English in a “silent mode” is something familiar to students in the classroom from third grade on in the primary school. Students often study the language quietly through uni-directional lecture, rote memorization or paper-and-pencil tests, while attempting to apply what they have learned through interactive oral activities is something challenging to both students and teachers. As a tool of communication, the role of English in the classroom should be reassessed. In addition to students, scholars and professionals from different disciplines may have chances to attend international meetings of academic or fieldrelated circles. There is also a strong need for them to improve English oral abilities (Chen, 2011). Accordingly, the biggest and urgent problem for Taiwanese students and adults alike is—diligently as they may have learned English for a decade, hardly could they utter a sentence in a proper way. As one of the four skills regarding English learning, speaking competence has been highly valued and recognized important globally, yet less taught during regular school 1.

(11) classes. Besides speaking with fluency, how to speak with appropriate pronunciation has been found to play a vital role in effective communication (Pennington, 1999; Precoda & Bratt, 2008; Teng, 2002). The sound pattern of a language, phonology, as Pennington (1999) indicated, is “surface of form in spoken mode.” This suggests the irreplaceable status pronunciation holds in language learning because it is what the listener may make first impression on when having a conversation. Based on Pennington’s study (1999), phonological competence has crucial effect on both production and reception of spoken language; thus, if equipped with a good command of pronunciation, speakers are ensured to demonstrate better oral proficiency and are able to make themselves understood effectively, offer a precise response, and in turn hinder most misunderstanding from arising (Celce-Murcia, Brinton & Goodwin, 1996). Also, speakers will have less difficulty decoding and comprehending the spoken output as well. On the contrary, even if grammar and word use are undoubtedly correct, the abashed pronunciation below certain level can possibly impede the communication of message and intended meaning (Chen, 2011; Eskenazi, 1999). Learners who find it difficult to attain successful communication may feel less willing to practice the target language, which drives them into a vicious cycle. Despite the increasing notice paid to communicative needs and its importance, the instruction and enhancement activity of speaking or pronunciation training are frequently neglected still. 2.

(12) 1.2 Problem Statement Several factors could serve as reasons for why speaking is seldom taught at school. These reasons are generalized into environmental and human factors and addressed respectively in this thesis. With respect to environmental factors, school teachers are restricted by tight schedule and limited amount of time for each class, which makes it fairly impossible to provide judicious proportion of time for speaking practice, not to mention having the leeway to pay attention to individual utterance or even pronunciation accuracy (Ali, 2016; Chen, 2004; Derwing& Munro, 2015; Hincks, 2001; Wang & Young, 2014). To provide individualized feedback, therefore, turns out to be the least possible choice under the current setting of large class size. Besides, since English is taught as a foreign language in Taiwan, students are rarely exposed to the language and have few opportunities to practice speaking outside the regular class. The scanty exposure in an EFL context in certain way lower learners’ motivation because there is no pressing need to do English speaking in their daily lives. Also, to conform to the long-standing paper-and-pencil exam frame, teachers tend to train students in their reading and writing skills instead of oral production. Under these circumstances, both teachers and students are mired in the bind of not being able to use the language without any inhibitions. As for human factors, two important parties involved are teachers and learners. 3.

(13) Firstly even if teachers try hard to squeeze time for speaking activity, whether they have been well trained so as to have the profession to correct students nuanced pronunciation error and further help them adjust is another problem (Ali, 2016; Derwing& Munro, 2015; Hincks, 2001; Saito, 2007). Moreover, it is seldom heard that teachers make students do “speaking practice homework.” For one thing, there are insufficient teaching materials designed and suitable for spoken English training at school. If teachers would like to assign speaking exercises, they have to spend extra time to organize the content by themselves, which might lay great stress on them (Ali, 2016). For another, homework of speaking practice is both difficult to submit and to be evaluated. This is why speaking practice is oftentimes ignored outside of the classroom. Second, based on several previous research, it is found that students have difficulty communicating or expressing their opinions in English publicly (Ali, 2016; Kimura, 2013; Neri, Cucchiarini& Strick, 2001; Wang & Young, 2014). It is because most of them are introverted as well as stressful speaking in front of the classmates, afraid of making any tiny mistake that might make them lose face (Chiu, Liou & Yeh, 2007). What’s more, the heterogeneous nature of each class makes it even difficult for teachers to pay heed to everyone’s learning motivation and current competence. Therefore, the affective variables, both the feeling of performance anxiety and lack of confidence, to some extent, become critical barriers to speaking improvement as well. 4.

(14) Fortunately, with the growing advancement and evolution of diverse multimedia technology, learners are endowed with far more opportunities to study English with the help of computers. Once they gain access to learning technology and programs tailored to their needs, the learning process might not be interrupted after school. Computerassisted language learning (CALL) system or design has indeed facilitated language learners in various aspects in recent decades, among which an emerging technology called automatic speech recognition (ASR) has evidenced to be the most effective one to assist learners in improving speaking (Chiu, et al., 2007). As indicated by researchers who reviewed over 350 previous studies relevant to technology use in teaching and learning foreign language, Golonka, Bowles, Frank, Richardson, and Freynik (2014) suggested that in spite of an array of captivating technologies aimed at enhancing language skills, the fact of limited efficacy is shown via research. However, computerassisted pronunciation training (CAPT) and the use of chat to elevate language production are different. These two innovative technologies stand out and have made a significant contribution and measurable impact on language learning. ASR turns out to be the promising star that casts light on the difficulty teachers and students are encountered with, simultaneously rendering abundant pedagogical possibilities for either teaching or researching.. 5.

(15) 1.3 Purpose of Study Why is ASR being so popular as an alternative for speaking or pronunciation training? Why have most researchers held positive attitude towards this technologyenhanced learning system along with its development? As the technology makes progress day by day, users are guaranteed greater accessibility and affordability concerning computers and related appliances. Its convenience brings a plethora of glad tidings to language learners and teachers who want to improve or help improve language skills, particularly for which require repeated exercises and immediate individual feedback. Speaking or pronunciation is exactly one of the skills that demand a bunch of practices accompanied by prompt corrective feedback, which is an indispensable element in pronunciation instruction and learning (Ehsani & Knodt, 1998; Neri, Cucchiarini, Strick, & Boves, 2002). As a result, the appearance of ASR technology seems to be facilitative in bridging this gap— the gap between teachers and students who desire to improve speaking but who always find despair trying to put it into practice. In an EFL context like Taiwan, most of our learners suffer from this distress; they are confident about possessing abundant syntactic and word knowledge comparable to speakers of the language. Yet in terms of speaking on authentic occasions, they are still uncertain of manipulating the language for successful communication owing to the insufficiency of practice as well as the interference of their mother tongue 6.

(16) and cultural backgrounds (Chiu et al., 2007). Now, as long as teachers have the willingness to be the bridge builders connecting students to the ASR technology, it not only saves teachers from the effort-taking teaching & correcting process, takes care of learners’ affective concerns, but also flips the classroom by stimulating students to learn on their own. This new mode of learning speaking is seemingly a win-win situation that benefits both sides. As it is reported by the literature, there are a number of advantages of applying ASR technology to speaking practice, and they are categorized into several dimensions here. ASR benefits learners in speaking competence in four dimensions, including (1) who, (2) when, (3) where and (4) what. More details are provided and discussed in the following passages. Firstly for the dimension (1) who, learners using ASR to practice speaking may have independent control over their learning schedule (Elimat & AbuSeileek, 2014; Purushotma, 2005; Wachowicz & Scott, 1999). Thanks to the function of instant corrective feedback, the ASR system naturally becomes their study partner and individual tutor. There is no necessity to face the classmates and teachers, and they don’t have to worry about the invisible competition between peers, which might greatly reduce the fear of threatening atmosphere, lower their anxiety and embarrassment accordingly (Elimat & AbuSeileek, 2014; Eskenazi, 1999). What they have to do is simply click on the button, speak, and wait for the evaluation from the 7.

(17) ASR system. The affective aspect as well as their privacy are spontaneously taken care of through self-study. On the side of teachers, they can incorporate ASR technology into the originally time-consuming drilling practice. The charisma of using computer in class and obtaining immediate feedback from ASR system could greatly arouse learners’ motivation as suggested by the past studies. For the dimension (2) when, learners are allowed to practice at their pace, having sufficient and flexible time to practice and not restricted by school time. Whenever they would like to review what they’ve learned yesterday, or advance toward next challenge beforehand, they are able to make the decision by themselves. In this way, tight schedule or time pressure at school is no longer an issue concerning speaking practice. They can practice speaking literally, at any time, with the assistance of ASR checking the accuracy. Furthermore, unlimited trials and errors are permitted without causing others’ impatience and fatigue (Chen, 2004). If the learner is not satisfied with the performance, he or she can keep trying repeatedly until fully grasping the tip of pronouncing it. Thirdly for the dimension (3) where, learners are released from the shackles of practicing speaking “only” in the classroom. With a laptop or cellphone installed with adequate ASR programs at hand, any quiet place can be a good environment to initiate speaking exercises. Lastly the fourth dimension is (4) what, one learner, one computer with microphone, and ASR-based CALL system are all that requires to comprise a 8.

(18) speaking practice scenario. It doesn’t take tremendous efforts to find a one-on-one tutor to listen for and point out learners’ pronunciation errors. Supported by the corrective feedback system, ASR technology is expected to examine learners’ language input and pinpoint errors highly similar to those diagnosed by human raters (Chen, 2004; Neri, Cucchiarini, & Strik, 2003). To enhance their speaking competence, learners are just supposed to sit in front of the computer and adjust their pronunciation deviation based on the ASR feedback. Since learners have to rely heavily on the ASR feedback to make adjustment, two issues ought to be particularly worked on. Whether the error highlighted by the ASR system is accurate and clear enough, and whether the recognition result will be presented in a meaningful and informative way to learners, are two essential issues worth deeper investigation (Chen, 2012; Luo, 2016). For the former, it has something to do with the in-built structure design of ASR system. If the system is constructed based solely upon native speakers’ model, it might be more suitable for users whose L1 is English; if the system incorporates both native and nonnative speakers’ model, it will be more acceptable for users of ESL or EFL speakers, and the quality of recognition could be better in that the variety of language backgrounds has been taken into account. For the latter, the form of corrective feedback may have effects on learners. That is, the demonstration of corrective feedback, such as spectrograms, waveform, a list of 9.

(19) accurate and inaccurate words, or score, has the deterministic power towards learners’ speaking improvement (Neri et al., 2003). Those pronunciation errors produced by learners are often the trace of their L1 interference, or so-called negative transfer (Neri et al., 2002; Pennington, 1999). The influence of their L1 could become obstruction that impedes the development of target language, with pronunciation being the major victim. In fact, negative transfer has been claimed to be the fundamental factor leading to mispronunciation (Liu, 2011). Consequently, it is imperative that the ASR system can provide learners with pertinent diagnosis of their mispronunciation. Otherwise, the undetected errors could slip away from learners’ attention and be stored permanently in their speaking corpus, further leading to the fossilization of those erroneous sounds (Eskenazi, 1999; Neri, Mich, Gerosa & Giuliani, 2008). Therefore in this study, the above two issues will be coped with simultaneously.. 1.4 Significance of Study As speaking competence is recognized as an indispensable ability in this globalized era, there is no excuse for not helping our students improve their oral skill whether at school or after school. The advent and application of ASR-based CALL system hence plays an ever-important role in helping us attain this objective. However, when it comes to ASR system designated for improving speaking, most of the softwares or programs 10.

(20) are for commercial or academic use. In other words, learners are charged or the system is not available oustside of the classroom, which might keep a number of students who cannot afford the price away. Luckily, a nationwide learning website named LearnMode, jointly developed by the programmers in technology industry and people putting great efforts into education, is born to assist students in all sorts of subjects from elementary school to senior high school. The website is free of charge and open to anyone who registers. In this website, there is a section specifically designed for training speaking competence, supported by an automatic speech recognition system that diagnoses students’ utterances. Students who gain access to this learning website are able to enjoy all the conveniences and advantages given by the ASR technology. The aim of this study is to examine the effectiveness and impact of the speaking assessment section in LearnMode website. In addition, the recognition result from ASR will be analyzed and compared with the diagnosis report of human raters. When practicing speaking on the platform, learners are like staying in an individual and private environment. What they can trust is the result offered by the ASR system. As a result, the ASR system deserves further inspection to ensure to which extent learners can benefit from its feedback as well as make progress in pronunciation and to identify problematic parts to be refined. Besides, learners’ perceptions toward the user interface, 11.

(21) and the feedback provided by automatic speech recognition will be a vital issue to deal with and also helpful references for future studies. If there is optimistic result derived from this study, it will definitely bring good news to learners, teachers and researchers who are interested in or concerned about deploying ASR technology in their language laboratories.. 1.5 Definition of Terms For terminology mentioned throughout this research, they are going to be briefly introduced in the following. 1) ASR: Automatic Speech Recognition 2) ASA: Automatic Speech Assessment 3) CALL: Computer-Assisted Language Learning 4) CAPT: Computer-Assisted Pronunciation Training. 12.

(22) CHAPTER TWO Literature Review In current research, the main focuses are effectiveness of employing ASR system in pronunciation diagnosis, and learners’ perceptions toward such innovative technology as an alternative for evaluating and facilitating their pronunciation performances. Therefore in the following chapter, literature review based on the wisdom of previous studies will be weaved through several threads, including how ASR has developed and how it does work and benefit language learners, all the way to how users perceive and evaluate its performance. After the six section of detailed exploration, the research questions will be proposed at the end of this chapter.. 2.1 Automatic Speech Recognition Technology 2.1.1 A Revolutionary Character in CAPT In the surging wave of technological advancement during 1980s, computers had been expected by teachers and researchers alike to assume a greater role of language learning facilitator. Among a variety of computer assisted language learning programs, one specific skill is particularly difficult to touch upon—speech and pronunciation (Cordier, 2009). Therefore with the emergence of automatic speech recognition (ASR) technology, it is hoped that the tool has the capability to initiate changes in computer. 13.

(23) assisted pronunciation training (CAPT) and transform CAPT into an intelligent and user-adaptive learning companion (Ehsani & Knodt, 1998; Kim, 2006). In essence, automatic speech recognition is an independent, machine-based process of decoding the incoming speech and transcribing it based on predetermined model and algorithm (Levi & Suvorov, 2012). Of several kinds of speech engines embedded in ASR programs, the Hidden Markov Model (HMM) has been widely used and attested to be an effective method of handling a large number of vocabulary words by sophisticated statistical computations (Ehsani & Knodt, 1998; Hinks; 2003; Kim, 2006; Levi & Suvorov, 2012). The intricate technical aspect of HMM is beyond the scope of the present study, though. In brief, when receiving spoken input, HMM-based program will do the calculation to compare the incoming speech and inbuilt model and then determine the degree of their similarity based on probability theory. Recently, many other engines have come into existence such as the Kaldi Speech Recognition Toolkit, recognized as efficient and easy to adapt by recent studies (Gaida et al. 2014). The rapid development of speech engines would indeed provoke another wave of growth in automatic speech recognition and open up a whole new page in the field of CAPT and ASA (Chen& Chen, 2018).. 14.

(24) 2.1.2 The Discrepancy Issue between ASR Systems and Human Raters Typically, the full process of automatic speech recognition in assisting learning can be simplified as follows— the acoustic signals uttered by the speaker will be received via a microphone and temporarily stored in a recording space. To pinpoint the learner’s deviations from the native model, those signals are diagnosed in comparison with the inbulit model so as to tease apart acceptable and unacceptable utterances. Within seconds, the ASR system generates automatic feedback on local pronunciation problems and on overall performance of nonnativeness displayed in learner’s speech (Precoda & Bratt, 2008). Ideally, the moment that learners receive the feedback, they can instantly grasp a rough idea of to which degree they detour from the model pronunciation and in what way they can finetune their utterances to approach higher accuracy. But its effectiveness relies heavily in the form which the feedback is shown; the issue will be tackled in the section below. In addition, whether the automatic speech recognition is reliable enough and provides consistent assessment for similar input, and whether it keeps a high degree of agreement with the evaluation of human raters have always been major concerns for teachers (Ashwell& Elam, 2017; Neri, Cucchiarini & Strik, 2003). Certainly, there is little possibility that humans and computers can render exactly the same evaluation result each time, as Bernstein & Franco (1996) mentioned in their study, “Humans and 15.

(25) machines process speech in fundamentally different ways.” Ehsani & Knodt (1998) also specified differences between human and machine when it comes to recognizing and rating speech: Complex cognitive processes account for the human ability to associate acoustic signals with meanings and intentions. For a computer, on the other hand, speech is essentially a series of digital values. However, despite these differences, the core problem of speech recognition is the same for both humans and machines: namely, of finding the best match between a given speech sound and its corresponding word string. Automatic speech recognition technology attempts to simulate and optimize this process computationally. (Ehsani & Knodt, 1998) In spite of the distinct processing mechanisms between the two evaluators—computers and human beings, the goal of correctly diagnosing speech and offering effective feedback on learners’ problematic parts, however, is undoubtedly in the same direction. Therefore, as an assistant or even a substitute for real instructors in pronunciation teaching and diagnosis, it is important that ASR technology should be consistent regarding the quality of feedback it produces and possess good sensitivity to decipher learners’ digital codes and to locate inappropriate pronunciation.. 16.

(26) 2.2 Automatic Speech Recognition for Pronunciation Instruction 2.2.1 Limitations in the Past Before the birth of ASR technology, computers have already participated in helping pronunciation training in a limited but ASR-oriented fashion. The pronunciation training at that time presents such limitations as exercise types, the insufficient supply of teachers, the lack of real-time feedback, and so forth (Neri, Cucchiarini & Strik, 2001). In CAPT system without ASR technology, most of the exercises intend to train receptive competence while few strives to urge learners to produce their own utterances. What’s more, learners using some systems are responsible for comparing their own recorded voices with the native utterances completely by themselves. This has been seriously questioned since studies show that learners individually have a hard time interpreting the phonetic discrepancy between their mother tongue and the target language, not to speak of being able to extract useful information of mistakes that need adjustment (Chen, 2012; Luo, 2016). This requirement for learners is too rigorous and professional to bear. On the other hand, for CAPT systems that enlist teachers to act as evaluators for the recorded practice voices, learners encounter similar problem compared to that of regular language classes at school. That is, the teacher-to-student percentage is unfavorable; students are given unequal amount of instruction time while some might 17.

(27) receive the feedback few minutes later than other companions. Unlike real-time or immediate intervention that has been believed to effectively hinder learners from committing repetitive errors, the delayed feedback that makes learners wait for even a short period of time can lessen learners’ motivation to keep doing exercises or contribute to false impression of certain pronunciation (Eskenazi, 1999). In light of the addressed problems, ASR technology is seemingly the most optimistic candidate which creates an optimum environment for pronunciation practicing. Each individual learner is essentially unique; likewise, errors they produce are varying and deserve one-on-one tutor that listens and attends to every single nuance deviating from the model pronunciation. With ASR technology in hand, learners seem to have their private tutors who can concentrate on one learner at a time and walk them through the mist of pronunciation exercises (Chen, 2011; Chiu, et al., 2007; Eskenazi, 1999). Just as Eskenazi (1999) and many other previous researchers mentioned, if the teacher is likely to constantly monitor learners’ progress and employ proper remediation measures, under that circumstance, language learning would appear to be most efficient. But in reality, such barriers as teacher-student ratio, time arrangement, exam pressure and the like make it impossible for teachers to realize the goal of keeping track of each students’ progress. Therefore in the realm of pronunciation training, ASR shoulders this mission to guide learners all the time without causing any fatigue. The features of 18.

(28) constant availability and patience naturally make ASR system an adequate learning tool (Eskenazi, 1999). 2.2.2 An Affectionate and Lovely Beginning Despite of its great benefits in language learning area, initially, ASR is designed to other purposes. In fact, it is not until the late 1970s that ASR has its roots in educational field. Behind the very beginning of ASR technology for language learning lies an ever-touching episode. It is when Destombes at IBM France learns that his daughter is deaf that he sets forth to invent a system to help her practice speaking intelligibly (Eskenazi, 2009). In his ASR software, pitch and intensity versus time are displayed, and the interface is designed game-like. Though there are merely simple pitch displays and individual phone detection in this stage, it has marked a giant leap toward. From then on, slowly but steadily, ASR technology has been applied to pronunciation learning of different age groups or integrated into several kinds of speaking activities, which makes pronunciation learning a relatively realistic and engaging process (Elimat & AbuSeileek, 2014).. 2.2.3 Deeper Investigation into the Principles for ASR-based CAPT To develop a favorable ASR-based pronunciation training system, several principles should be observed as Eskenazi (1999) listed on the basis of Kenworthy’s. 19.

(29) (1987) principles in successful pronunciation learning. The five principles are outlined and succinctly discussed in the following: a) Learners must produce large quantities of sentences on their own. It is an ideal scenario since so far most of the ASR-based programs hasn’t enabled or allowed learners to produce their own sentences; instead, learners are deemed as a rather passvie role while interacting with ASR. Learners can only practice readymade sentences or dialogues. With regard to the technical limitation, there is little probability to practice sentences constructed by users. Their creativity to produce sentences is not encouraged in a sense. If a high degree of reliability of feedback is expected, it is demandable for the underlying speech recognizer to deal with unexpected utterances. Thus, this is the niche where researchers and programmers can work on. b) Learners must receive pertinent corrective feedback. The keyword here is “pertinent.” What teachers concern the most is the quality of error detection and appropriate feedback provision. If these two requirements are met, then worries from both teachers and learners are eased. c) Learners must hear many different native models. It is advised that learners have the opportunity to be exposed to native teachers from different backgrounds (Celce Murcia & Goodwin, 1991). In this way, learners 20.

(30) will collect more speech data and their speech repertoire would be expanded. Also, they might feel less threatened in face of English with different accents. d) Prosody (amplitude, duration and pitch) must be emphasized. Prosody is like the decorations that help show the emotions of the utterance. Through prosody inclusive of intonation, pitch and so forth, speakers’ intended meaning can be fully expressed. Therefore, Chun (1998) thinks prosody should be taught in the beginning of language learning. e) Learners should feel at ease in the language learning situation. This issue has already been mentioned in the previous sections. But there are two other special points raised by Eskenazi (1999). One is if the amount of interruption, namely timely feedback, can be adapted to the extent which learners can tolerate, the discouraging effects of correction could be avoidable. The other point is “false negative” should be prevented or minimized to the least, which has long been a unsolved challenge in ASR-based pronunciation training programs (Eskenazi, 2009). Sometimes, learners may speak correctly but are judged to be wrong and in turn passion for constant practicing is extinguished. In fact, studies have shown that learners informed of being wrong as they are right would feel strong negative emotions. On the contrary, learners informed of being right but in effect they are wrong (i.e. false positive) would feel less negative emotions. It is thus important to 21.

(31) deter false negative from taking place frequently. In general, the first four principles describe the physical environment of the learning system, while the last item portrays learners’ mental environment like their emotional state or capacities for taking in information. Once the outer environment is blended concordantly with the inner one, it may foster an far more amiable speaking practicing atmosphere that breeds successful pronunciation advancement. Thanks to ASR technologies, what teachers can hardly take care of previously would be carefully treated then.. 2.2.4 Diving into Procedures in ASR-based CAPT After principles that ideally need to abide by in ASR-based pronunciation learning have been checked, it is equally important and worth a closer examination into its mechanisms. How automatic speech recognition works from the very moment that it receives learners’ oral production to the instant that it returns feedback with detailed evaluation marks is going to be covered here. As Neri, Cucchiarini and Strick (2003) proposed, there is an ideal sequence of five phases for pronunciation training programs powered by automatic speech technologies. They are 1) speech recognition, 2) scoring, 3) error detection, 4) error diagnosis and 5) feedback presentation showing respectively below:. 22.

(32) 1) Speech recognition: This is the very first and most significant step. The ASR engine decodes the incoming speech data and translates them into a string of words based on their phonetic and syntactic feature, the accuracy of which the subsequent steps highly depend upon. As Neri, et al. (2003) stated, the pedagogical contribution of ASR-based CAPT lies rightly in offering the evaluation of learners’ pronunciation quality. Hence, whether the ASR engine is reliable is a core question. 2) Scoring: In this step, learners are informed of their overall spoken utterances in the numerical form. By comparison between learners’ spoken output and that of reference native data, the ASR system would do immediate evaluation and report a definite score. The closer the learners’ utterances get to that of native models, the higher score is expected. Through the score, learners could obtain basic understanding of their overall pronunciation quality. 3) Error detection: Here in this phase, oral production by learners is filtered by ASR system while the problematic utterances would be filtered and highlighted. By means of locating certain sound errors within a word, learners’ attention would be dragged to erroneous area that has to be specifically coped with more practices. It is believed that learners’ awareness would be particularly raised if errors are accurately detected. 4) Error diagnosis: Providing learners with their errorneous pronunciation parts alone 23.

(33) is not enough, which is like we offer students fishing rods but we don’t teach them how to fish. After learners’ awareness has been raised, professional ASR system would identify the specific type of error made by the student; moreover, the way to making improvement is going to be suggested. This mechanism is essential in that it is difficult for learners to identify the exact nature of their errors by merely resorting to their background knowledge. 5) Feedback presentation: This phase is the final and generalized one, which summarizes the information secured and displayed in the previous phases. Whether the overall score is presented as a graded bar or as a number on a given scale, it should be interpretable to every learner. Therefore, the design matters since learners will only be able to make full use of the information supplied by the ASR system if it is shown in a meaningful fashion.. 2.3 ASR-based English Learning Programs and Its Exercise Design There are various kinds of pronunciation training programs supported by ASR engines, such as TRACI Talk, Candle Talk, Tell Me More, MyET and many other emerging language learning programs focusing on developing speaking competencies of different languages. With the advancing of technology, most of them can be readily approached through personal computer or mobile devices. With reference to the content. 24.

(34) design of them, learners will find oral exercises ranging from rudimentary vowel/ consonant pronunciation practice, daily conversation activities through role-play of different genres, to task-based crime scenario of looking into a mystery as a detective. When exposing themselves to the ASR-enhanced environment, learners individually starts journeying down the path to oral skills augmentation. Chen (2001) conducted a detailed review of five commercial ASR software programs, inclusive of CNN Interactive English, Syracuse English Comprehensive Learning Series, TeLL Me More Pro, TRACI Talk as well as Encarta Interactive English Learning. Even though these programs don’t quite fulfill the ideal learning conditions proposed by Carol Chapelle (1998), Chen found that their far-reaching influence cannot be understated since they encourage learners to generate far more spoken output than they usually do in the regular lecture classroom setting. As for the learning content and exercise design of these five items, the material of CNN Interactive English is retrieved from a CBS TV comedy show, Caroline in the City. Learners can choose a particular role and say the lines out loud, and the ASR engine will do its part. In the Syracuse English Comprehensive Learning Series, it has similar activities like role-play along with videos. Besides, another type of activity is choosing the best response by saying out loud. On the screen, there are three options in response to the question raised by a virtual partner. In terms of TeLL Me More Pro, two 25.

(35) types of activities are included. One is to repeat after the selected words and sentences, basically one kind of training to build the bricks of pronunciation. The other is to choose the most appropriate response to the question by saying it out loud. This knid of practice on the one hand evaluates students’ speaking accuracy, and on the other side tests their background knowledge about how to react decently in a conversation. The fourth introduced program is TRACI Talk, an acronym for Teacher Ranging Across the Computer Interface, indicating the hidden teacher ready for assistance at any moment within the computer. Acting as a role of sleuth, learners are involved in a string of task-based conversations in order to garner necessary clues from four suspects to unravel the mystery. In the process, all they have to do is keep speaking, whether making inquiries or answering questions clearly, which naturally immerses themselves in a gripping English-speaking environment. As players think they’re confident enough to identify the culprit, they can advance to the next stage and answer eight randomly chosen questions (out of a pool of 60). If they succeed in the eight challenges, they may have the chance to collect more favorable evidence. If they fail, then there will be more opportunities to engage in additional oral practices. In fact, those virtual partners they converse with are their teachers in the game. Also, when learners get lost and want some instructions for the next step, they can say “Traci” into the microphone and a box showing the task content will emerge. The invention of TRACI Talk combines speaking 26.

(36) improvement with game elements, making oral practices joyful. The fifth one is Encarta Interactive English Learning, which mingles video, realtime 3D animations and ASR engines together. There are in total 10 units constituting Encarta Interactive English Learning, a software that aims at enhancing listening/ speaking/ grammar ability and enlarging one’s vocabulary size. Initially, learners need to create their own profiles that help keep track of the progress they make. Then, an abundance of over 360 activities and 10 formative assessment, called Virtual Challenge, awaits them. Oral output are elicited and examined when learners participate in the speaking activities and Virtual Challenges. For the former, they will listen to a native speaker model, immitating and playing the role in a video afterwards. For the latter, learners are placed in a 3D virtual space. Their task is to answer questions, find some characters, and other activities that challenge them to speak on and on. However, Chen (2011) also pointed out that learners might not have the privilege to relish the convenience out of the language laboratory or classroom in that the aforementioned five ASR programs are for commercial use. The price tag, like US$250 for each Tell Me More software program, may be too hefty to be affordable for learners. In the another research, Chen (2011) further published the review of several academic ASR-based English learning programs like the Subarashii program, Virtual Conversations, the Voice Interactive Training System (VILTS) and FLUENCY. In 27.

(37) reality, many reaserchers and universities have developed their own speech recognition software programs in an attempt to probe the interaction of ASR engine and learners’ speaking improvement. For instance, a web-based conversation program, CandleTalk, is built by scholars in a Taiwanese university and working on providing an environment for situational dialogues, primarily targeting six speech acts (Chiu, Liou& Yeh, 2009). These speech acts, which include greeting, parting, requesting complaining, apologizing, and complimenting, are the central topic of the four units. Each unit in CandleTalk is designed with a pre-patterned route, which automatically guides the learner throughout the dialogues. With several options shown on the screen, which option the learners select by speaking into the microphone will bring them to a different path of dialogue. After making their choice, they further need to decide the span of appropriate recording time, press the recording button and finally do the speaking. As learners proceed by starting or responding to the given conversation, the plot with prepatterned route embedded will intelligently unfold as if learners are being answered. With the developing of plot, ideally leaners will get to familiarize themselves with the usage of speech acts, and in the end of every single unit, there will be a review section serving as supplement and a prompt to raise their awareness of the key learning content. Another web-based speech technology program developed by the academic institute is National Taiwan Normal University (NTNU) ASR, which is based on 28.

(38) Google Speech API. In total, there are six kinds of exercises with the first three lending themselves for novices. They are (1) name the object in the picture, (2) listen and repeat on-screen sentences, and (3) listen and respond with a best reply out of four assigned options respectively. For the fourth to sixth exercise, they are far more challenging and thus tailored to intermediate level learners. In the fourth section, learners are reqiured to role-play different characters speaking in a dialogue so as to gain the final score. The fifth type incorporates multimedia element into the exercise. Learners must watch the Flash animation and practice every line in the conversation, which makes the selflearning process more intriguing. As to the sixth type of exercise, “Write and Speak,” learners are granted a considerable degree of freedom since they can practice speaking whatever they write. Once they type in or paste sentences, the system will generate an exercise. Therefore, users are able to take control and pave an individualized road to their current needs. Different from many other types of activities, this function is beneficial to learners who want to pursue personal goals. Overall, this website-based ASR program provides learners a comprehensive environment with exercises varying from fixed drilling task to flexible and self-produced practice, presenting many opportunities of what ASR programs can strive for. Aside from Candle Talk and NTNU ASR, MyET (My English Tutor), one of the web-based speech recognition software programs used by over two hundred million 29.

(39) registered members, is also developed by a team of ASR experts in Taiwan and available for free online download (albeit with limited learning contents). The intent of MyET is to serve as an individual tutor assisting students in enhancing English ability through polishing oral skills. Implementing the technology of “Automatic Speech Analysis System,” MyET can analyze learners’ utterances down to the last detail. After listening to the native speaker models, learners are required to imitate and practice speaking sentence by sentence. For the native models, the accents vary. Therefore, learners have the right to choose the preferred accent. Even better, MyET is known for its web-based feature, which allows worldwide online learning communities to be established. Learners engaging in the program have the chance to compete with contestants in different scenes within a virtual community. The connection to people who are also putting effort into improving oral skills may work as incentive to facilitate more involvement in learning. The ASR-based oral skill practice website Cool English was developed with funding from the Ministry of Education, Taiwan. In the speaking section of this webiste, Kaldi, an open-source toolkit for constructing speech recognition system, was adopted. There are in total eight separate sections for learners to sharpen their oral communication skills. On the basis of the modality of its learning content and exercise types, these eight sections can be grouped into four major categories. They are (1) 30.

(40) choose a proper response, (2) listen and repeat the sentence, (3) role-play with the video and (4) practice and play with what you produce. The first category is commonly seen in most of commercial ASR software programs, which asks learners to at first read/listen to a sentence and then choose a proper answer from three on-screen options. Various topics like traveling, hobbies, shopping, transportation and so on are prepared and allow learners to polish oral abilities. The second category of exercise is listen and repeat the sentence, and there are three kinds of sub-categories classified by its learning content. For 英語 Fun 城市 (Ying-Yu Fun Cheng-Shih ; Let’s have fun in speaking English.) and 英語魔法學園 (Ying-Yu Mo-Fa Hsueh-Yuan; English magic school), they have 60 topics and 10 main topics repectively and both focus on useful expressions in daily conversation. For 英式武功短語先鋒 (Ying-Shih Wu-Kung Tuan-Yu HsienFeng; kung-fu for English short phrases), the speaking exercises ranging from 3-word phrases to 7-word phrases are extracted from corpus data based on their frequency of daily use. Last but not least, the section 我的絕對「英」感 (Wo-De Jiue-Duei YingGan; My absolute judgement for English pronunciation), lays emphasis on some confused pairs of vowels/consonants for native speakers of Chinese. In order to help learners be aware of the eleven sets of target confused sounds, at least 10 related sentences are provided in each exercise. The third category combines videos with automatic speech recognition systems. 31.

(41) In the 看動畫學英文 (Kan Tung-Hua Hsueh Ying-Wen; Watch animations and learn English) section, there are animations based on the content of previous versions of junior high school textbooks, with eight volumes, 92 lessons in total. On the other hand, in the section 「聲」歷其境 (Sheng Li Chi Ching; Watch VOA videos and learn English), there are twenty videos retrieved from Voice of America, a famous broadcast channel from America. In these videos, learners can be exposed to authentic materials and accent as well as immerse themselves in a more native-like surroundings. After watching the videos, learners are asked to take turns acting on each role in the video. Sentence by sentence, they will familarize themselves with some useful situational expressions. The fourth category is relatively uncommon in ASR programs. It incorporates the idea of learner autonomy into speech technologies. In the section named 口說自助吧 (Kou-Shuo Tzu-Chu Ba; Let’s practice your self-made sentences.), self-paced learning is emphasized in that learners are endowed with the right to create their own oral skill training materials. After they type in their target sentences, the ASR system will automatically produce an exercise. Learners can first listen to the pronunciation and take a shot at saying the sentence out loud into the microphone. All of the mentioned ASR programs concerning its content& exercise designs as well as the feedback and correction type are organized in Table 1 in the following.. 32.

(42) ASR-based. Content&Exercise Design. Feedback&Correction Type. ASR Engine. program 1.. Role-play activities (choose. CNN Interactive. a role and read the. English. dialogue). 1) There’s visual voice spectrum comparison without interpretations. Microsoft Speech. 2) Try again if the output isn’t accepted.. Recognition. 3) Adjustable sensitivity of the ASR. Engine. engine. 2.. 1) Video role-play. 1) Learners are asked to “try again” if their. IBM Speech. Syracuse. activities;. pronunciation and intonation are not. Recognition. English. 2) choose and say the best. accepted.. Engine. Comprehensive. response out loud. 2) Virtual conversation partner will show. Learning Series. different expressions to denote the appropriateness of learners’ output.. 3. TeLL Me More Pro. 1) choose and say the best. 1) The acceptable answer will be. response out loud. highlighted; learners must try again if. 2) listen and reproduce the selected word/ sentence. the response is not understandable. 2) The elaborate voice graph and scores are presented without interpretation. 3) There’s adjustable sensitivity of the recognition system.. 4.. Learners are immersed in a. The corrective feedback will be like “Sorry,. IBM. TRACI Talk. crime solving game. By. I don’t catch that” rather than offering. VoiceType. conducting the. specific pronunciation and intonation guide.. speech. investigation through keep. recognition. asking questions and answering, 5.. 1) Listen to a native. 1). There is no feedback in this part.. Microsoft. 2). The judgement of ASR engine is loose. Speech. Encarta. speaker model and. Interactive. immitate (Video role-. for most input is acceptable. ASR. Recognition. English. play activities). engine has to deal with impromptu. Engine. Learning. 2) Virtual challenge:. utterances, which might reduce its. learners are challenged to answer questions,. accuracy rate. 3). Intervention mechanism: The programs. talk with characters and. automatically replay the video. find objects, etc. with. segments when there’re more than two. no on-screen hints. to three improper /irrelevant sentences.. 33.

(43) 6.. Situational dialogues with. There are 3 criteria to indicate learners’. Hidden. Candle Talk. the six speech acts as foci.. performance in each unit.. Markov. Learners are placed in a. (1) “Good Ending” shows the unit is. models. pre-patterned conversation environment so as to. successfully finished.. (HMMs). (2) “Bad Ending” means learners have. familarize themselves with. made some mistakes choosing wrong. the six speech acts.. responses during the conversation. (3) “Unknown Ending” suggests the unit is not fully finished.. 7.. There are six type of. There will be a pass remarks with loud. Google. NTNU ASR. excercises, including. applause if the production is acceptable,. Speech API. (1) identify the object. while “no recognition” indicates the. (2) listen and repeat. unacceptable spoken input.. (3) choose a best response (4) role-play (5) conversation in animation (6) write and speak. 8.. Listen to the native models. There will be a average score and 4 other. My ET. and repeat/ record sentence. separate marking standard, including. by sentence.. pronunciation, intonation, fluency and volume.. 9.. (1) choose a proper. 1) discrete& overall score. Cool English. response (2) listen and. 2) color bar based on overall score. website ASR. repeat the sentence, (3). 3) comment (passing remark). role-play with the video. 4) color highlight on errorneous word. and (4) practice and play. 5) a round of applause. with what you produce.. Table 1. The exercise design and feedback type of the discussed ASR programs. 34. Kaldi, an open-source toolkit for speech recognition written in C++.

(44) 2.4 Different Types of Feedback in ASR Technologies In every speech recognition technology program, the way it offers corrective feedback is the universal concern to users. Whether the feedback can clearly locate exact pronunciation errors and how it functions as an assistant to specify correct way of articulation are both significant. The feedback type and its related functions of the aforementioned ASR programs will be discussed in the following paragraphs. In CNN Interactive English, learners are allowed to record their utterances so as to compare and contrast with the native speaker models. The effectiveness of using this kind of visual feedbck to help improve the pronunciation, though, is under debate since it’s hard to elicit meaningful messages by learners alone without any professional background in voice analysis. As for role-play section in Syracuse English Comprehensive Learning Series, the feedback is either “acceptable” or “try again” to indicate the accuracy of the processed speech. In the dialogue part, if the answer is accepted, the virtual partner will smile and continue the conversation; otherwise, the partner will flash a puzzled expression and the leaner needs to try again so as to proceed. Next, the dialogue mode in TeLL Me More Pro works similarly since learners are told to “try again” if their response is unacceptable, while the correct response will be highlighted. Besides, in the pronunciation practice, an intricate voice spectrum of the native model is aligned with that of the learner. However, the pedegogical implication 35.

(45) of the spectrum will not be intelligible if the learner is not explicitly instructed. In response to the improper utterances, TRACI Talk reminds the learner by issuing statements like “Pardon, can you say that again?” or "Sorry, I do not catch that, could you say again?" to encourage more attempts. Learners need to try harder to make themselves understood in English. If the spoken content is adequate, the story will continue its development, which may make learners feel their utterances are responded. Speaking of the Virtual Challenge sessions in Encarta Interactive English Learning, learners are asked to generate spontaneous production without any on-screen hint and it does pose a real challenge to learners themselves together with speech recognizers. Also, when there’re more than three improper sentences stated, the program with an intervention mechanism automatically replays the related video segments in order to refresh learners’ memory of what they’ve just learned. As far as Candle Talk goes, the feedback is presented in an implicit way. There are three criteria to indicate learners’ performance in each unit. If the unit is successfully finished, the words “Good Ending” is shown, while “Bad Ending” means learners have made some mistakes choosing wrong responses during the conversation and “Unknown Ending” suggests the unit is not fully finished. While researchers like Eskenazi (1999) argues that the value of the feedback created by the speech recognition system lies in finetuning learners’ pronunciation, the purpose of Candle Talk is to steep advanced learners in an authentic 36.

(46) environment, which extra interference like explicit pop-up correction might break the communication. Without specific correction on the input, feedback of this kind intends to stress the communication competence rather than teeny deviations of spoken production. The program MyET takes a different road by providing learners with detailed corrective feedback on different aspects. There will be an average score and four other separate marking standards, including pronunciation, intonation, fluency and volume. On the basis of these scores, learners may know where they should put more efforts into. The feedback type in NTNU ASR is quite straightforward. There will be a pass remark with loud applause if the production is acceptable, whereas “no recognition” indicates the unacceptable spoken input. Last but not least, the feedback offered by Cool English website ASR is diverse. In some sections, there are overall score and discrete scores on every individual word. In addition, a color bar and comments will show in most sections at the same time. Finally, if there is major mistake in one certain word, it will be hightlighted in color red and words with minor mistakes will be highlighted in orange. However, it would be better if the ASR system can show learners in what way they can improve on those errorneous words.. 37.

(47) 2.4.1 Summary of the Feedback Types in ASR Programs Overall, the type of feedback applied in each automatic speech recognition programs can be classified into two, expicit and implicit feedback, with the adjustable sensitivity of the speech recognizer acting as a variable. As Chen (2001) pointed out, some commercial ASR systems (CNN Interactive, TellMeMore, and Microsoft Encarta) actually allowed learners to adjust the sensitivity settings, i.e., the standard of acceptance. This may significantly influence the elicited result and compromise the validity of the ASR system. Researchers and teachers alike need to pay more attention to this leverage factor. If the adjustable sensitivity of the recognizer is put aside, the corrective feedback can be sorted into explicit and implicit categories. The former can include several subtypes. First, numerical type of feedback such as scores on discrete word, overall performance and different aspects like intonation are quite intuitive to learners for they can grasp a rough idea of how well they perform. However, it would be better if there are supplementary information to help fix the errors. Second, some ASR programs offer textual feedback that presents specific instructions on how to adjust to pronounce correctly. This may benefit learners if they can collect the key information to make progress. The last subtype is visual ones such as speech spectrum or highlight of erroneous words. In fact, since these two ways of feedback is too simplistic because of 38.

(48) the lack of accurate figures, they somehow belongs to the implicit type. If there is no guided instructions on how to interpret the result, it is not possible for learners to transfer this experience to new practice materials. With regard to implicit feedback, they are categorized into four aspects in the study, which are textual, visual, audio and interactive. Comments like “Please try again” or “Pardon, I don’t catch that” belong to the type. This may facilitate learners to make more attempts but no instruction is given on how to avoid making the similar mistakes. Different emoticons categorized into the visual type of feedback tells learners whether they should try again or not, which may be suitable for young learners who are unable to read complicated corrective feedback instructions. However as for speech spectrum without interpretations, it is both difficult for learners and teachers to decipher the message lying behind if they’re not armed with sufficient liguisitc knowledge. Thirdly, audio feedback like a round of applause simply serves as one kind of encouragement and renders no real corrective benefits. The last subtype is interactive feedback. If there’s no serious errors in learners’ utterances, then the story will keep developing. Even though the speeach recognizes don’t specify errors made by learners, the plot development to some extent gives impetus for learners to keep on trying. An organization of the explicit and implicit feedback is presented in Table 2.. 39.

(49) Explicit Feedback Feedback Type Variable adjustable sensitivity of the speech recognizer. Strengths& Weaknesses. Implicit Feedback. Numerical (1) scores on discrete word/ overall performance (2) scores on different aspects Textual (3) specific instruction on how to adjust to pronounce correctly Visual (4) speech spectrum. Textual (1) comments like “Try again.” “Pardon, I don’t catch that.” Visual (2) different emoticons (3) speech spectrum without any interpretation/ instruction Audio (4) a round of applause for passing. (5) highlight of erroneous word. Interactive (5) plot development with the learners’ correct oral input. 1) Discrete word and other scores give speakers a rough idea of how well they perform on every single word and sentences, but it will be better if the direction of how to correct is provided. 2) It is advisable that speech. In general, the weakness for these types of implicit feedback is that they are not able to show learners in what way they can minimize their errors or how to adjust their current pronunciation completely to the accurate one. However, if learners. spectrum and the highlight of errorneous word be accompanied by specific and intelligible instructions.. themselves can’t interpret difficult explicit instructions, then these feedback types are good enough for propelling them to always try again.. Table 2. The categorization of feedback types in ASR technologies.. 2.5 Automatic Speech Scoring and Its Accuracy To assist learners in ther pronunciation drills, it is advisable that immediate corrective feedback targeting their probelmatic areas and scoring should be provided to let them know the quality of the oral production. One significant major concern, however, is that the process of scoring is considerably labor intensive (Ashwell& Elam, 2017). Therefore, it would be beneficial if automated speech assessment could be 40.

(50) substituted for manual scoring. To tackle this issue, Ashwell & Elam (2017) embarked on the investigation into automated scoring of elicited immitation (EI) tests. By working with Google Web Speech API, they attempted to investigate its speech recognition ability with regard to both native speakers’ and Japanese L2 learners’ oral production. Plus, human raters were recruited to make judgments and score the non-native speaker production so as to examine whether the missing words recognized by the ASR system accord with the misprnounced words identified by human raters. This examination is important since the result can deliver message showing whether there is discrepancy between human and machine as well as the effectiveness of ASR in scoring. After having 44 non-native speakers read the same 13 elicited immitation items, Ashwell & Elam (2017) found that the word that was evaluated as the most frequently mispronounced word in every item was not always consistent with the most often missing word recognized by Google Speech API. Take the word “Sue,” for example. The proper noun “Sue” accounted for 55% of all missing words judged by the ASR system, while it only represents 19% of the total mispronounced words perceived by human raters. Overall, only six items out of 13 indicates the consistency of scores between human raters and ASR system, which inevitably brought about suspicion that whether it was proper to apply the ASR to deal with learners’ oral production. The authors, though, suggested optimistically that the system doesn’t need to work perfectly 41.

(51) for every single input. What people who want to employ ASR for any pedagogical purposes should do is adapt to the ASR’s strengths by restricting the input in certain way. Learners accordingly would be judged fairly based on their authentic ability without any problematic words that influence the result of their performances.. 2.6 Learners’ Perceptions of Using Automatic Speech Technologies With the ASR-enhanced technologies in pronunciation drills flourishing, it is significant to not only measure learners’ improvement by using the cutting-edge tool but also take care of their feelings, or more exactly, perceptions and attitudes toward it. According to a vast range of previous studies, EFL learners as a whole exhibited positive attitudes toward practicing speaking with ASR. Developing a ASR-based website (NTNU ASR) on his own, Chen (2011) invited college freshmen and preservice teachers to use the website for ten-week period and two hours respectively. Both parties were required to submit a questionnaire or evaluation report. From the survey, it is shown that most students found it beneficial for their speaking and listening skills, and pre-service teachers found that the ASR-enhanced website could establish a low anxiety environment for students to do the practice. Similary, to better understand EFL students’ perceptions of ASR-enhanced website Cool English, Chen (2017) invited 30 junior high learners to use the website over a four week period. After the survey is collected, it is indicated that most students expressed their preference for the rich 42.

(52) content and benefits like speaking skills enhancement they might gain from it. Wang& Young (2015) conducted a research which adopted different feedback approaches and strategies to enhance learners’ speaking articulation. Learners in the study stated that using the ASR-based technology can promote their opportunities of speaking English and they really enjoyed speaking in the process. Besides offering enjoyable and safe learning experiences to users, some ASR-based CAPT programs even provide interactive acitvities which engage leaners in tasks, making it more inviting and authentic (Alsatuey, 2011). In addition to desktop computers, the ubiquity of mobile devices also presents an alternative for speaking practice. In a study done by Ahn& Lee (2016), a total of 302 students participated and used the mobile application, Speaking English 60 Junior, for two weeks. Later on, a questionnaire was administered to solicit learners’ perceptions with reference to its design, convenience and efficacy. The results from the questionnaire reported that students generally held postive attitudes after experimenting with the ASR application, with 54% of them finding it convenient to use and 57% of them satisfied with its helpfulness in practicing speaking.. 2.7 The Summary of Literature Review With the easy access to the program enhanced by automatic speech recognition technologies at hand, EFL/ESL learners are proffered an alternative way to training. 43.