In this chapter, there will be an elaboration of the methodology in this research.
The overview of the study design, the subjects, the testing materials and functionalities of the ASR-supported website and the procedure of data collection are first and foremost presented. Then, the result of data analysis including the outcome of assessment and the user survey are analyzed in the following chapters.
The present research aims at exploring an automatic speech recognizer free for educational purposes. The study contains two specific parts: one is to examine and compare the mispronounced words judged by ASR-based system and human raters, and the other is to investigate learners’ perceptions and responses after interacting with the ASR-based English speaking practice and assessing website, LearnMode.
3.1 Participants
To obtain a more detailed understanding of students’ difficulties in pronunciation and their perceptions of improving oral abilities with the assistance of ASR-based engine, the present study recruits 120 participants from Taipei Municipal Yong-chun Senior High School. (However, only 66 students stick to every task, so others are eliminated from the study.) These students are all second-year senior high school
students whose English grades in the CAP (Comprehensive Assessment Program for Junior High School Students) range from B to A+. The participants are informed of taking part in a project that is going to assess and diagnose their current English speaking competence, particularly focusing on their pronunciation. The researcher guides three class of students to register and log onto the website LearnMode, and ask them to try one exercise in order to ensure all of the students know how to submit the assignment at home. These students are required to read the assigned sentences and dialogues into the microphone in their typical way and at a normal speed. One thing worthy of note is that students are only allowed to hand in their spoken production once.
The purpose of this one-shot attempt is to avoid practice effect. In this regard, the result that is true to their original way of pronouncing these words may be elicited.
While the participants are reading the assigned contents aloud, their voices are simultaneously recorded and saved for the following analysis. Also, they are able to listen to their own productions for unlimited times if they want to know what their pronunciation sounds like. After they finish each unit of speaking exercise, submit and wait for a while for data processing and recognition, they will receive both feedback of the overall performance and detailed account on problematic words detected by the AI robot. If time permitting, the teacher in charge can give comments and scores as well to supplement what the ASR system does not offer. After students complete all of the
exercises, a questionnaire will be administered to investigate their perceptions of using the ASR-enhanced website to improve their oral abilities.
3.2 Procedure of Study
The procedure is quite straightforward since the purpose of the present study is to explore three issues below. Firstly, what the problematic words pinpointed by the ASR robot are and what phonetic features they do share are investigated. Second, whether the problematic words located by the ASR robot are consistent with those judged by human raters is discussed. Third, the narration of how students perceive and reflect on their interactions with the ASR website is collected. As a result, the procedure consists primarily of orientation, treatment, data collection& interview and finally data analysis.
After the orientation session, the oral speaking practices are assigned as take-home tasks since the quiet environment without any other interference is necessary. The teacher also distributes a feedback form to every student to help them keep record of their practices, on which students need to note down the date, the time they spend on each unit, the feedback from the ASR system, difficult words they expect themselves to stumble across and the actual problem stated by the speech recognizer (see Appendix B). The first three items serve merely as basic information check whereas the last two parts intend to probe for participants’ self-expected performances and post-treatment
findings. In this way, the instructor hopes the feedback form will work as a reminder for students to complete the task and in the meantime raise their awareness of difficult words to themselves. Then, students have to log onto the LearnMode website either on the computer or on the smartphone as long as there is microphone function.
There are 20 units of practice in total, each of which focuses on testing out different pairs of vowels and consonants as they are recognized as common confusing pronunciation for EFL learners (Li, 2013). For each unit, students are just permitted to submit their oral production once so as to avert practice effect and make the result true to their daily performance. When all of the practices are done, the researcher as well another human rater can start to analyze the results from the collected data. Besides, students are invited to fill the questionnaire and talk about their observations and perceptions during the speaking practice session. The procedure of this study is presented by a flowchart as follows.
3.3 Instruments
3.3.1 The Oral Practice Task: 10 pairs of difficult pronunciation to EFL learners In the study, participants are asked to log onto the website LearnMode and work on twenty separate tasks targeting specifically on sentences or dialogues with difficult vowels or consonants for Mandarin-speaking EFL learners. There are ten pairs of difficulty sounds for EFL learners examined in the research, six pairs of vowels and four pairs of consonants included. In each unit of task, there will be around five sentences plus one to two dialogue, in which learners’ easily mispronounced words are explored. The full sentences and dialogues are enclosed in Appendix A.
vowels/consonants [i] and [l]
[e] and []
[ɛ] and [æ ] [u] vs. [ʊ]
[ɔ] vs. [o]
[æ ] and [a]
[] and [ð]
[ʧ] vs. [ʤ]
[l] and [r]
[s] vs. []
3.3.2 LearnMode, a free learning website benefiting both teacher& students
The main tool used in the present study is the speaking section embedded in the website called LearnMode 學習吧(Hsueh-Shi Ba). The speaking section is developed by Taiwanese programmers based on Google API, featuring the application of automatic speech recognition system and AI technologies. The website is free of charge and everyone can have access to it as long as they create an account and sign in. It is of great convenience for teachers because they are able to offer their own online courses and assign related tasks according to individual pedagogical needs. Aside from creating learner-tailored hands-on materials, the website also incorporates several practical and user-friendly functions. For example, the textual learning materials can be directly converted into sound files online, and learners’ oral output can be recorded and stored on the website for later use. The former greatly lessens teachers’ burden of making eletronic files of the texts on their own, while the latter benefits both students and teachers for they can refer back to the oral production repeatedly if they want to improve particular erroneous pronunciation. Besides, the instructor can designate a limited span
of making a new speaking exercise is shown in the screenshot (see Figure 2).
The most important feature of the speaking section is automatic speech assessment (ASA), which utilizes automatic speech recognition (ASR) system to evaluate the incoming utterances. After students finish saying the sentences, these sounds will be packed in a batch and instantly sent to Google’s ASR system located in the U.S for further processing. Few minutes later, we here in Taiwan will receive the result, including the score and comments. In the current ASR system, several types of feedback
are presented, including two kinds of feedback (color highlight, comments with color light) and two indexes (correctness rate, fluency). For color highlight, if there are mispronounced words or omission of words, these words will be highlighted in color red under either of these two conditions. On the other hand, if learners insert additional words in between, the system will show what those insertions are and highlight them in orange color. Learners will be able to know their weaknesses and practice on these
Figure 2. The interface of creating a new speaking exercise on LearnMode
problematic parts. Also, there will be a comment for learners in order to provide encouragement and suggestions for improvement. One example is shown in Figure 3.
Regarding the correctness rate, it is calculated under a preset formula, that it, word count of correctly pronounced word divided by word count of the whole text. Lastly, the fluency is also derived from a formula, which is word count of correctly pronounced word divided by how much the time the student spends reading it, but the fluency score of students is not discussed in the scope of current study. These two scores are only shown on the side of the instructors, and on the basis of the given scores by ASR, the
instructors can listen to the recorded files again and make their own judgment. In fact, instructors have the authority to grade students’ performance and make their personal comments. What’s more desirable is that teachers can build their own “comment bank”
involving commonly-made errors and feedback and complete the grading in just a click.
On the whole, LearnMode adopts a mixed approach to help users locate problematic Figure 3. Color highlight and comments as feedback shown on the learner side.
words by incorporating subjective human rater opinions into the result from automatic speech recognition technologies. By so doing, students will not only receive feedback from the virtual ASR tutor but also obtain comments from their own teacher, which both makes use of the convenience of technologies as well as strengthens students’
attachemnt to their own teachers. The exact formula of correctness rate and fluency is displayed in Figure 4, with one example following in Figure 5
Figure 4. The detailed formula of the two indexes: correctness and fluency.
Figure 5. The correctness rate and fluency scores shown on the teacher side.
Better yet, besides recording the accurate time of submission to track students’
performance, there is another function bringing considerable benefits to teachers. After students submit their assignments, the teacher can download an overall report which keeps record of all scores recognized and generated by the ASR system, and some statistical calculations are even performed to provide the teacher with mean score, the highest score among the group, and the mean score of who score among the top twenty-five percent, and so forth. These statistics may aid teachers along with researchers in analyzing the distribution of students’ current levels and the gap between each level.
3.3.3 Feedback form, Questionnaire and Interview
Feedback forms are distributed to all of the participants to act as a checklist and learning log for them while performing oral tasks (see Appendix B). On the form, learners have to record the scores rendered by the ASR in each exercise, the date they do the practice, the total time they spend on one exercise, the corretness rate and fluency of their production, and the problematic words they expect to meet and they really encounter. Besides, they have to show their perceptions of the effectiveness of different types of feedback, including correctness rate, fluency and red/ orange highlight, by
“strongly agree.” Under each column, there is one statement designed by the researcher to examine participants’ opinions after every single trial, such as “I feel less stressful using ASR technology as the assessment tool for speaking English.” or “I think I can make progress with the help of ASR system.” All the contents are written in Chinese
since it is not a test on English but simply a evaluation form probing learners’
viewpoints. However, only few participants turn in the feedback form in the end of the study, so the result will not be reported here.
All of the participants are also demanded to fill out a questionnaire after they’ve completed all exercises (see Appendix C). The quesionnaire consists of two part:
background information and their attitudes toward the ASR system on the website LearnMode. The quesionnaire intends to garner participants’ observations and personal reflections on the design of feedback types and their thoughts and feelings using this
kind of technolgy to assess the oral abilities. The participant have to respond questions like, “I know how to modify and improve my pronunciation after I see the red/ orange highlight of .erroneous words,” or “The ASR system motivates me to further improve
my English pronunciation.” The five-point Likert scale is adopted to allow learners to show their opinions.
The one-on-one interviews (see Appendix D) with the participants are conducted to collect their reflections on using the ASR system. Some of the proposed questions
are like “While using the ASR system for testing oral abilities, do you find any pronunciation problems that you haven’t noticed before?” or “Do you think the
feedback offered by the system is good enough for you to identify your errors? Why or why not?” With the result of these interviews in hand, the researcher can explore
whether the ASR system successfully increases learners’ awareness of difficult words they meet and whether the feedback provided by the system indeed facilitaites their improvement in pronunciation.
3.3.4 Human Raters and the Scoring
Two raters were invited to perform the evaluation of participants’ speech production in the current study. Both of them are female senior high school English teachers, with one being the linguistics major in graduate school and about to study abroad for a doctorate in applied linguistics this summer vacation. In the beginning, the two raters, including the researcher herself, discussed the scoring mechanism and conducted several training to make sure every rater holds similar scoring belief and method. For the scoring system of LearnMode, although the intricate algorithm behind is unknown to users, the score produced by Google web speech API is basically the number of matching words divided by total words in the prompt sentence (Ashwell&
Elam, 2017).
To ascertain the inter-rater reliablility, there were twelve of the participants’
recordings serving as the trial evaulation before the formal one. The inter-rater reliablility was estimated by applying Pearson correlation coefficient, which indicates the degree of agreement among raters. The statistic result revealed a high correlation of 0.799 among the two raters in terms of their accuracy ratings. The two raters were thus able to cope with the scoring of the following spoken output.
3.4 Data analysis procedure
Two types of data are going to be collected for the current research to address the reaserch questions: the oral testing result derived from both the ASR system and human raters for research question one to question three, and the participants’ responses to questionnaire& interview for research question four. These collected data are subsequently analyzed either quantitatively or qualitatively to validate the results of pronunciation assessment with the assistance of automatic speech recognition system.
3.4.1 Similarities and differences between the result of ASR and human raters
For resarch question one, the outcome of student oral production will be anaylzed by the human raters, i.e. the researcher herself and another senior high school English teacher, to compare and contrast the recorded original audio files and the recognition result. Whether or not the automatic speech technology of LearnMode can correctly
diagnoses students’ probelmatic words is one main issue that the study would like to
probe into.
3.4.2 The feedback form and responses to the questionnaire and interviews
The participants of the study are asked to keep record of the score generated by the ASR system and the expected/ actual problems they encounter. The feedback merely functions as a learning logs and secondary reference to (1) remind participants of completing the task and (2) raise their awareness of difficult pronunciation to them.
Additionally, a five-point Likert scale (1 strongly disagree; 5 strongly agree) is adopted to allow individual student to express how much they agree or disagree with the questions concerning the effectiveness of different types of feedback by the ASR system.
With reference to the questionnaire, it consists of basic background information and relevant questions investigating participants’ attitudes toward utilizing ASR system
as an assessment tool for speaking ability. In one-on-one interview questions, it mainly concentrates on the examination of learners’ perceptions of their own pronunciation
problems and observations in the process of using the speech technologies. The responses from the participants will be generalized to present an overall reflections and comments on the oral assessment process.