Though some researchers acknowled ged the potential of chatbots either by proposing new system designs (Lu et al., 2006; Stewart & File, 2007; Wilcox, 2011a, 2011b) or via study findings (Chen, 2012; Huang et al., 2008b; Jia & Chen, 2008; Sha, 2009), several studies on chatbots generally expressed concerns with the accuracy of chatbots’ messages (Jia, 2004a; Williams & van Compernolle, 2009), leaving us with an incomplete picture regarding the effectiveness of chatbots on EFL learners. There are several reasons of this situation.
First of all, lack of a sensitive tool of measurement might also be one reason why previous studies failed to capture the beneficial effects that many found promising. The current study adopts various syntactic indices to examine the development of learners’
responses over a period of six weeks. It is possible that the beneficial effects derived from target language interactions are limited to certain aspects of the language. It is hoped that with a more thorough analysis of data, a more complete picture will be observed.
Second, the selection of participants might also determine whether chatbots’ positive effects would be utilized. For Conaim’s (2008a) and Jia’s (2004a) studies regarding machine- human conversations, it is arguable that having genuine English learners or lower
33
level learners who are in urgent need of sharpening their conversational skill would be more likely to demonstrate the potential of chatbots. Having advanced learners mimic the language used by beginners or intermediate learners is after all not as authentic as the responses obtained from participants that match the specific language levels. In addition, for students who have ample opportunities to talk to target language speakers or to take tailor- made language lessons, chatbots might be redundant and have no place to serve its purpose.
Thirdly, to observe the impact of conversations with chatbots, evaluating the development of the users’ language is also essential. In formal Turing Tests, both the human and machine entities are all evaluated in terms of their human- likeness using a scale of one to ten. Most studies, however, only investigated the language used by the chatbots without analyzing the human interlocutors’ responses. The current study thus focuses on the potential impact of selected chatbots on participants’ production, writing complexity, and perceptions of chatbots.
Fourth, based on the collected literature, the researcher found a paucity of experimental studies of chatbots on English learning. Research involving the latest Loebner Prize winners was also rare (Torma, 2011). If the claimed benefits from previous studies are significant enough, an experimental design study with sufficient treatment duration might shed more light on the effects of conversing with chatbots.
Fifth, the chatbots involved in previous chatbot studies were either continuously improved by the botmasters or were abandoned altogether and became no longer accessible online. It is thus necessary to keep evaluating recently developed chatbots to shed light on their effectiveness in carrying out appropriate conversations. The updated versions might also come embedded with additional features that shelter better conversations.
Sixth, even though the variety of chatbots has been stated as one advantage to spur motivation and avoid boredom (Fryer & Carpenter, 2006), no studies have been found to make good use of it. Having continuous conversations with even a well-designed chatbot
34
might still suffer from repetitions of similar topics or even the same utterances; therefore, the current study offer participants opportunities to talk to different chatbots to add spice to the learning process and to keep the learning momentum.
This study attempts to overcome the six issues observed from the review of previous chatbots studies. The methodology section in the ne xt chapter is designed to counter the aforementioned limitations.
35
CHAPTER THREE METHODOLOGY 3.1 Selection Process and Design of Chatbots
3.1.1 Chatbot Selection
There are four criteria in the chatbot selection process for the current study. First of all, chatbots that are limited to portable devices or commercial products are not included for feasibility reasons. For the sake of data collection, the current study only selects chatbots that offer complete conversation transcripts. This first restriction excludes chatbots constructed using the mechanisms made by the Personality Forge and Façade, because both do not offer complete chat transcripts. Therefore, only AIML-related bots, Jabberwacky-style bots and ChatScript bots are left to be considered.
Third, since the current study focuses on applying chatbots to the purpose of language learning, chatbots that are designed explicitly for the purpose of teaching English as a foreign language (EFL) would have the priority to be selected. Meg, a chatbot in an online English tutoring website, is free for tryout on SpeakGlobal (http://www.speakglobal.co.jp). In the researcher ’s email correspondence with Bruce Wilcox, he indicated that Meg was the product of one of his contracted project and was programmed in ChatScript, the chatbot programming language he created (Wilcox, personal communication, June 16, 2012, see Appendix G). Meg offers complete transcripts and comes equipped with an animated avatar, a natural speech synthesis unit and a preliminary speech recognition engine in the free trial version. Because Meg offers complete transcripts and meets the purpose of EFL learning, she is chosen on behalf of ChatScript bots for the current study
Fourth, chatbots that are potentially more likely to sustain more human-like conversations with the participants are selected. There are two groups of such candidates. The first one are previous winners in various chatbot contests. The other one is those that come equipped with additional functions to shelter conversations with the human counterparts.
36
Those two categories take precedence over the others if transcripts are available. On the
“Awards” section of the website of chatbots.org, both the most awarded ever and within the past three years are listed. Based on the four selection criteria, in addition to Meg, Cleverbot and Skynet-AI are also chosen. Table 3 illustrates the details of the original candidates.
Table 3: Details of the Original Chatbot Candidates for the Current Study
PC/Free Transcripts EFL/ESL Features to Maintain Chats Most Awarded Currently
A.L.I.C.E Yes No No No
Ultra Hal Yes No No No
Talk-Bot Yes No No No
Elbot Yes No No No
Jabberwacky Yes Yes No Speaking to itself
Most Awarded in Three Years
Capt. Jack Sparrow Yes No No No
Cleverbot Yes Yes No Speaking to itself
Skynet-AI Yes Yes No Drawing on online resources
Zoe Yes No No No
Designed for Language Instruction
Meg Yes Yes Yes Suggested topics and sentences
Mike Yes No Yes No
Dave ESL Bot No Yes Yes Suggested topics and responses Note. PC/Free = computer-based and free of charge
ESL/EFL = whether the bot is constructed for English as a second language (ESL) or English as a foreign language (EFL) instruction.
37
For the convenience of data collection, chatbots that work exclusively on tablets or laptops, do not offer complete transcripts, or require subscription fees are first excluded. From the rest of the free chatbots, those that support English learning are then chosen. The last round of selection is based on chatbot contest results and additional features. Since Jabberwacky and Cleverbot belong to the same botmaster, Rollo Carpenter, thus his latest work, Cleverbot, is chosen, though both meet the same criteria.
3.1.2 Design of the Selected Chatbots
This section illustrates the functionalities of the three selected chatbots, including Meg, Cleverbot, and Skynet-AI. Figure 3 displays the interface of Meg.
3.1.2.1 Meg
Meg is made using ChatScript, the mechanism developed by Wilcox and is the only chosen system geared for the purpose of EFL instruction. Therefore, additional supporting functions are equipped by the creator to shelter the chat. First, Meg’s range of topics is specified for learners to choose from. Suggested sentences are also included in the right section of Figure 3. As a ChatScript bot, Meg is also embedded with a dictionary and can answer definitional questions. Meg also speaks in natural voice simultaneously.
Figure3: User Interface of Meg in Speakglobal, Ltd.
38
Meg was originally made for Japanese students; therefore, the translation function is only for English-Japanese translations. Since Chen (2012) found dictionary consultation to be common among secondary school users of chatbots for either comprehension or production, in order not to disturb the process of conversation due to lexical difficulty or consultation behaviors of external devices. The current study will invite users to install a free plug- in of click-and-translate application called TransOver (Vers ion 0.28) to mitigate the linguistic difficulty in terms of comprehension. TransOver can simultaneously translate words, phrases or even paragraphs and display the counterparts in a separate box as illustrated by Figure 4.
Figure 4: Simultaneous Translation Using TransOver with Meg
3.1.2.2 Cleverbot
Cleverbot has been the latest representative chatbot of Rollo Carpenter ’s Jabberwacky.com, Icogno Ltd., and Existor Ltd. As other Jabberwacky bots like Jabberwacky, George and Joan, Cleverbot learns from previous chat sessions and select s the most
Chinese translations via TransOver
Suggested questions
39
contextually appropriate response from what other humans have previously said to it (Jabberwacky, n.d.). Cleverbot only has a static image of a brain and its name on its website (www.cleverbot.com). Cleverbot also supports both a preliminary voice recognition function in the Google Chrome web browser and a text bar but it only responds in texts.
There are three buttons below Cleverbot’s message bar as illustrated in Figure 5. After typing one’s messages to Cleverbot, one can simply click on Enter or “Think About It!” and then Cleverbot will generate the most appropriate response from its database. If one does not know how to respond to Cleverbot’s questions, clicking on “Think For Me” and Cleverbot will automatically generate a response on your behalf (Ask About Tech, 2009) to keep the conversation going.
Figure 5: The Interface of Cleverbot
3.1.2.3 Skynet-AI
Skynet-AI (Version .005) is a rule-based chatbot made using JavaScript Artificial Intelligence Language (JAIL) and has a static image of a robotic head that glares from its eyes when responding to user input. Skynet-AI’s website lists three features of its mechanism
40
(Skynet-AI, n.d.): (1) the AI neural net is prioritized, hyperlinked and cascading; (2) dynamic learning is only retained during your current on-line session; (3) in question answering, the AI may resort to a cascading search algorithm. As illustrated in Figure 6, Skynet-AI is different from Meg and Cleverbot in that it simultaneously searches for online information to support accurate responses. For Version .005, Skynet-AI (n.d.) lists the following updated functions:
(1) initial implementation of hyperlinked, cascading neural net; (2) streamlined high speed word math problem solvers; (3) initial implementation of direct web page access based on context; (4) initial implementation of Temporal Reasoning System; (5) initial integration of Uber-Parser™; (6) extended dynamic memory/learning system; (7) extended spell check system; (8) world’s fastest AI system.
Figure 6: The Interface of Skynet-AI
Skynet-AI (Version .005) was under ongoing improvements and upgrade into Skynet-AI (Version .006) during the preparation stage of this research might be possible. In the correspondence with the creator, Ken Hurtubise, it might take about three to four weeks for the upgrade to be finished (Hurtubise, personal communication, September 11, 2012, see
41
Appendix H). But the upgrade was not done throughout the study so the configuration of Skynet-AI remained static for this experiment.
3.2 Design of the Study 3.2.1 Participants
A class of 42 students from one class in a public senior high school in Nantou was recruited to participate in the study and to receive a hundred-dollar gift card from the researcher and a certificate issued by their school for their participation. The sample was a group of volunteers who consent to be involved in the study. A research consent form was given to the potential participants to acquire consent (see Appendix A).
All participants were assigned to carry out open-ended conversations with each of the three selected chatbots for at least thirty minutes weekly for two weeks, accounting for roughly four hours of chat in an eight-week period. The current study adopts a within-subject design, three kinds of sequences of the three assigned chatbots were made to counterbalance the issues with boredom and sequence effect on the data. During the first and the eighth week, participants will converse with all three chatbots for fifteen minutes each for pretests and posttests. The data obtained will respectively serve as the starting profic iency and the outcome measure proficiency.
3.2.2 Instruments
This study adopts quantitative corpus analyses of the chat transcripts’ quality change of the participants and a survey questionnaire. The two instruments are elucidated below.
Chatbot user manual. Prior to the study, participants received a direction pamphlet (see Appendix B) that explained the functions of all three chatbots. Besides, an online dictionary installation manual (see Appendix C) was also given to assist students who would like to use it at home.
42
User perception questionnaire. The questionnaire was designed by the researcher based on the literature review to examine students’ perception of chatbots and their features. The items are designed based on a 4-Likart scale along with some places for written comments.
The first draft of questionnaire contained twelve items (see Appendix D). After a pilot study using a group of 15 similar participants who chatted with a chatbot for ten minutes and filled out the questionnaire, the researcher designed the final version of questionnaire (see Appendix E). Three items were deleted for their negative correlation values. The Cronbach’s alpha for the formal questionnaire was 0.79. Examples are listed below.
我 覺 得 這 一 個 聊 天 機 器 人 的 回 應 整 體 上 內 容 適 切 , 不 容 易 答 非 所 問 。 完 全 反 對 部 份 反 對 部 份 同 意 完 全 同 意
我覺得 Cleverbot 的 Think For Me!(中間的按鈕,請 Cleverbot 幫你回答的功能) 讓聊天過程更自然流暢。
完 全 反 對 部 份 反 對 部 份 同 意 完 全 同 意 因為:_________________________________________________________________
3.2.3 Data Collection Procedures
Table 4 illustrates the research schedule. Prior to the experiment, the researcher informed the participants of the details of chatbots using Appendix B. Besides, the manual of online dictionaries (see Appendix C) was recommended as a sheltering resource for students who found chatbots’ language too difficult to understand.
For Week 1 and Week 8, the participants were gathered in a computer lab to chat with Meg, Cleverbot, and Skynet-AI for a total of 45 minutes, 15 minutes for each chatbot. The students were required to store the transcripts and to send them to the researcher before they left the lab. The data collected in Week 1 served as the initial proficiency while those collected in Week 8 served as the outcome measures.
43 Meg Meg Skynet Skynet Cleverbot Cleverbot
Skynet Skynet Cleverbot Cleverbot Meg Meg
Note. The three highlighted weeks, namely, Week 3, Week 5, and Week 7, required the assigned chatbot for thirty minutes every week at the time and place of their convenience and sent their transcripts to the researcher to prove their participation. In Week 3, Week 5, and Week 7, the students also filled out the perception questionnaire before they moved to chat with a different chatbot in the following week.
3.2.4 Data Analysis Procedures
Transcripts were first manually separated the participants’ responses from the chatbots’
responses using Microsoft Word 2010 to ensure the analysis is purely based on the students’
own production. To ensure the quality of analysis, the researcher also capitalized the sentence- initial letters and added a period symbol “.” in the end of the students’ responses if the students did not put them there.
To examine the fluency and complexity of the transcripts, the web-based version of L2 Syntactic Complexity Analyzer (L2SCA) (Lu, 2010) was adopted. L2SCA counts the
44
frequency of nine structures in the text as fluency indices, including words (W), sentences (S), verb phrases (VP), clauses (C), T-units (T), dependent clauses (DC), complex T-units (CT), coordinate phrases (CP), and complex nominals (CN). Though the online version was limited in processing memory and according to the website, might skip processing longer sentences, none of the participants used sentences longer than one line; thus, the researcher still used the online analyzer.
L2SCA also computes fourteen syntactic complexity indices and Lu (2010) compiled them into five types as indicated by Figure 7. Mean length of sentence (MLS), mean length of T-unit (MLT), and mean length of clause (MLC) together belong to the first type, length of production unit. Clauses per sentence (C/S) alone represent sentence complexity, the second type. Subordination is the third type and involves c lauses per T- unit (C/T), complex T-unit ratio (CT/T), dependent clauses per clause (DC/C), and dependent clauses per T-unit (DC/T).
Coordination, the fourth type, consists of T-units per sentence (T/S), coordinate phrases per T-unit (CP/T), and coordinate phrases per clause (CP/C). The fifth type is particular structures and includes complex nominals per T-unit (CN/T), complex nominals per clause (CP/C), and verb phrases per T- unit (VP/T). The five types were used to discuss results of the study.
Figure 8 displays the analyzer interface.
Figure 7: Five Types of Syntactic Complexity Indices Automated (Reproduced from Lu, 2010)
45
Figure 8: The Interface of L2SCA
46
One can simply paste the texts to be analyzed, either one text into Step 1 only or two texts respectively in Step 1 and Step 2. Then click on the indices to be analyzed and then a diagram will be constructed. Figure 9 demonstrates the analysis result diagram using the sample sentences “I think I am smart. You think? Well, one can never be certain about anything.” The outcome diagram is animated and can show the exact value.
Figure 9: A Sample Analysis Result
This study also used PASW 18 for statistical analyses. For complexity indices, paired t-tests were used to compare complexity indices between pre- and post-tests.
For fluency indices, Wilcoxon signed-rank tests were adopted to examine whether the frequency counts of various structures were significantly different between the pre- and post-tests.
For grammar accuracy and message appropriateness, paired sample t-tests were performed and an inter-rater reliability test was also computed to ensure reliable grading.
Error- free sentence ratio (EFS/S) was chosen for grammatical accuracy since nearly all learners used only one sentence per response to a bot and this index could simplify the analysis process to achieve a similar goal of analysis. The researcher printed out and read all
47
transcripts in the pretests and posttests to calculate both indices. The grammar check was done in a nearly comprehensive and pass-fail manner for each sentence.
To obviate students’ problems possibly more related to careless typing, both raters did not regard the problems of mechanics, lack of space and capitalization as errors. The examples below are not regarded as errors in this study. Students’ responses are all italicized and boldfaced, while the erroneous segments are underlined and corrections made by the raters were parenthesized. In the end of each extract, three pieces of information are sequentially presented in parentheses, including the ID number of the participant in this study, the data source, and the bot he or she conversed with.
1. Lack of space and wrong mechanics.
User: Are you a girl or aman (a man).
Cleverbot: I am a girl. Are you a guy? (#7-pretest-Cleverbot)
Example 1 shows that the student forgot to put a space between the article “a” and the noun “man.” Besides, the user did not put a question mark “?” in the end of his question.
2. Capitalization Cleverbot: What song?
User: What? Do you mean “What is my favorite song?”
Cleverbot: Yes.
User: I like JJ Lin and korean (Korean) songs, especially Super Junior.
(#40-pretest-Cleverbot)
48
Example 2 shows a capitalization issue with the proper noun “Korean.” All the capitalization, lack of space, and mechanics problems were not considered errors in this study.
Only syntactic, morphological, lexical and spelling errors were counted as errors. Here are the
Only syntactic, morphological, lexical and spelling errors were counted as errors. Here are the