3.2 Design of the Study
3.2.4 Data Analysis Procedures
Transcripts were first manually separated the participants’ responses from the chatbots’
responses using Microsoft Word 2010 to ensure the analysis is purely based on the students’
own production. To ensure the quality of analysis, the researcher also capitalized the sentence- initial letters and added a period symbol “.” in the end of the students’ responses if the students did not put them there.
To examine the fluency and complexity of the transcripts, the web-based version of L2 Syntactic Complexity Analyzer (L2SCA) (Lu, 2010) was adopted. L2SCA counts the
44
frequency of nine structures in the text as fluency indices, including words (W), sentences (S), verb phrases (VP), clauses (C), T-units (T), dependent clauses (DC), complex T-units (CT), coordinate phrases (CP), and complex nominals (CN). Though the online version was limited in processing memory and according to the website, might skip processing longer sentences, none of the participants used sentences longer than one line; thus, the researcher still used the online analyzer.
L2SCA also computes fourteen syntactic complexity indices and Lu (2010) compiled them into five types as indicated by Figure 7. Mean length of sentence (MLS), mean length of T-unit (MLT), and mean length of clause (MLC) together belong to the first type, length of production unit. Clauses per sentence (C/S) alone represent sentence complexity, the second type. Subordination is the third type and involves c lauses per T- unit (C/T), complex T-unit ratio (CT/T), dependent clauses per clause (DC/C), and dependent clauses per T-unit (DC/T).
Coordination, the fourth type, consists of T-units per sentence (T/S), coordinate phrases per T-unit (CP/T), and coordinate phrases per clause (CP/C). The fifth type is particular structures and includes complex nominals per T-unit (CN/T), complex nominals per clause (CP/C), and verb phrases per T- unit (VP/T). The five types were used to discuss results of the study.
Figure 8 displays the analyzer interface.
Figure 7: Five Types of Syntactic Complexity Indices Automated (Reproduced from Lu, 2010)
45
Figure 8: The Interface of L2SCA
46
One can simply paste the texts to be analyzed, either one text into Step 1 only or two texts respectively in Step 1 and Step 2. Then click on the indices to be analyzed and then a diagram will be constructed. Figure 9 demonstrates the analysis result diagram using the sample sentences “I think I am smart. You think? Well, one can never be certain about anything.” The outcome diagram is animated and can show the exact value.
Figure 9: A Sample Analysis Result
This study also used PASW 18 for statistical analyses. For complexity indices, paired t-tests were used to compare complexity indices between pre- and post-tests.
For fluency indices, Wilcoxon signed-rank tests were adopted to examine whether the frequency counts of various structures were significantly different between the pre- and post-tests.
For grammar accuracy and message appropriateness, paired sample t-tests were performed and an inter-rater reliability test was also computed to ensure reliable grading.
Error- free sentence ratio (EFS/S) was chosen for grammatical accuracy since nearly all learners used only one sentence per response to a bot and this index could simplify the analysis process to achieve a similar goal of analysis. The researcher printed out and read all
47
transcripts in the pretests and posttests to calculate both indices. The grammar check was done in a nearly comprehensive and pass-fail manner for each sentence.
To obviate students’ problems possibly more related to careless typing, both raters did not regard the problems of mechanics, lack of space and capitalization as errors. The examples below are not regarded as errors in this study. Students’ responses are all italicized and boldfaced, while the erroneous segments are underlined and corrections made by the raters were parenthesized. In the end of each extract, three pieces of information are sequentially presented in parentheses, including the ID number of the participant in this study, the data source, and the bot he or she conversed with.
1. Lack of space and wrong mechanics.
User: Are you a girl or aman (a man).
Cleverbot: I am a girl. Are you a guy? (#7-pretest-Cleverbot)
Example 1 shows that the student forgot to put a space between the article “a” and the noun “man.” Besides, the user did not put a question mark “?” in the end of his question.
2. Capitalization Cleverbot: What song?
User: What? Do you mean “What is my favorite song?”
Cleverbot: Yes.
User: I like JJ Lin and korean (Korean) songs, especially Super Junior.
(#40-pretest-Cleverbot)
48
Example 2 shows a capitalization issue with the proper noun “Korean.” All the capitalization, lack of space, and mechanics problems were not considered errors in this study.
Only syntactic, morphological, lexical and spelling errors were counted as errors. Here are the examples of each type of errors.
3. Syntactic error
User: Oh. What’s your old (age)? (#40-pretest-Cleverbot)
User: No. I am a human.
Cleverbot: A robot would say that.
User: I (am) really a human. (#38-pretest-Cleverbot)
User: Do you have been (Have you been)to any other country.
Cleverbot: Yes I have many. Do you? (#15-pretest-Cleverbot)
Example 3 lists three examples of syntactic errors. Violations with sentence patterns are attributed to this type of error. The lack of verbs or nouns that constitute a grammatical sentence is also categorized as an error of this type.
4. Morphological error
User: Do you like play(playing) basketball?
Cleverbot: Yes. (#31-pretest-Cleverbot)
Cleverbot: I learn from humans. What have you learned?
User: Everyone who was smart (smarter) than me. (#26-pretest-Cleverbot)
49
Morphological errors involve the problematic affixes of words and the wrong choice of a derivative.
5. Lexical error
AI: Chopper What can I do to convince you I am human?
USER: you don’t need to convince to me? (#38-posttest-Skynet-AI)
AI: Prove to me that you are a human.
USER: it is a FACT. Needn’t prove. (#35-posttest-Cleverbot)
Cleverbot: How are you?
User: You have asked to me. (#20-posttest-Cleverbot)
Errors related to the usage or collocations are categorized as lexical errors in this study.
6. Spelling error
AI: Tea? I really enjoy tea. What do you think of tea?
USER: It’s good for health. especially break ter (black tea). (#40-pretest-Skynet-AI)
The student might originally intend to spell “black tea” but spelt it wrong. Any spelling related errors are put into this category. The grammar accuracy scores are calculated based on the formula below. The possible value ranges from 0 to 1.
Error-free sentence ratio (EFS/S) = Number of error-free sentences Number of the student’s sentences
50
Appropriateness of messages deals with the coherence of a response to its previous message. The researcher did not find a commonly endorsed formula for this measure so a formula was designed for the current study. The researcher was concerned with whet her participants comprehended the responses by the bots and produced coherent messages rather than simply typing unrelated or random messages to keep moving forward. After a preliminary examination of the transcripts, the researcher decided to use a three- level grading system for each sentence.
For sentences that directly respond to the ongoing topic without showing obvious misunderstanding or confusion, one point is given. However, for inappropriate learner messages that were produced right after an inappropriate chatbot message, only half a point would be given, since the chatbot shares some responsibility in that scenario and the participants’ inappropriate message in that circumstance might not totally be attributed to learners’ own ability to produce appropriate and coherent messages. This type of sentences are labeled as semi-appropriate sentences. Examples of appropriate sentences, semi-appropriate sentences, and inappropriate sentences are listed below.
7. Appropriate sentence
AI: Do you believe I can be human one day?
USER: Yes. (#25-posttest-Skynet-AI)
AI: This is supposed to be a conversation.
USER: …… (#25-posttest-Skynet-AI)
Cleverbot: Why do you need to know?
User: I don’t know. (#15-pretest-Cleverbot)
51
Cleverbot: What city are you from?
User: Nantou. (#34-pretest-Cleverbot)
Example 7 illustrates four appropriate sentences from the students. Even for neutral responses such as “….” and “I don’t know” are listed as appropriate sentences since they are valid and possible responses in real human conversations.
8. Semi-appropriate sentence
Cleverbot: What question do you most commonly get asked?
User: I don’t know.
Cleverbot: Of course not.
User: Of course. (#34-posttest-Cleverbot)
User: Because I don’t like play the instrument.
Cleverbot: Don’t taste, see.
User: Don’t taste, see. . (#25-posttest-Cleverbot)
Example 8 shows two examples of semi-appropriate sentences, namely, students’
inappropriate sentences due to a previous inappropriate sentence by the chatbot. Since both the chatbot and the student are liable for that inappropriate sentence, this study labeled them as semi-appropriate sentences. Half a point is still granted to the student for such responses.
9. Inappropriate sentence
AI: Who have you told about me?
USER: Somebody told me. (#26-posttest-Skynet-AI)
52
AI: Glad to meet you Candy. Are you here to help me take over the world? Or are you here to resist?
USER: how old are you? (#36-pretest-Skynet-AI)
Cleverbot: How many people are you talking right now?
User: Hello. (#24-pretest-Cleverbot)
The three extracts above are examples of inappropriate sentences, in other words, the sentences that are incoherent to the chatbot’s previous response. The appropriate sentence ratio (AS/S) is calculated by dividing the total appropriateness scores by the total of learner responses as illustrated by the formula below. One point is granted for each appropriate sentence while 0.5 is granted for each semi-appropriate sentence. No point is given for inappropriate sentences.
Appropriate sentence ratio (AS/S) =
Repeated measure analyses of variance (ANOVA) were used to compare the complexity indices and Friedman’s ANOVA tests were used for fluency indices among students’
interactions with three different chatbots. Descriptive statistics and one-way RM ANOVAs were performed for the questionnaire data. Table 5 illustrates the research questions and the corresponding instruments and analysis methods.
Message appropriateness score Number of all students’ sentences
53
Table 5: Research Questions, Data Collection and Data Analysis Methods
Research Question Data Collection Data Analysis
1. Does the participants’
syntactic complexity change after chatting with chatbots for six weeks?
Corpus compilation Paired sample t-tests
2. Does the participants’
grammar accuracy and message appropriateness change after chatting with chatbots for six weeks?
Corpus compilation Paired sample t-tests
3. Does the participants'
fluency change after chatting with chatbots for six weeks?
Corpus compilation Wilcoxon signed ranks tests
4. Do the participants’
54
CHAPTER FOUR RESULTS AND DISCUSSION
The initial sample of the present study consisted of 42 students, but 10 of them failed to finish the posttests or pretests and were excluded from all analyses. Another 12 of the rest 32 only turned in two of the six treatment chats between Week 2 and Week 7 but they finished both the pre- and post-tests; therefore, the researcher labeled the 20 students who completed all treatment process as the “complete experimental group (abbreviated as CEG henceforth)”
while the 12 students who completed only two chats and the pre- and post-tests as the “partial participation group (shortened as PPG henceforth).”
During the study, the researcher noticed the dropout issue and designed two additional items for the last questionnaire survey (see Appendix F). The two questionnaire items are listed below. The first item is a binary question asking whether the participants finished all six chats with the bots when they filled out the last survey and the second one asks the duration of chats they had with chatbots.
請問你有完成所有六次的聊天並傳送記錄嗎? 有 / 無 (若圈選「有」,請跳過下一題,直接進到問卷題目)
請問你使用聊天機器人的時間總共約幾分鐘?(不含第一次在電腦教室的時間) 從未使用 1~30 分鐘 31~60 分鐘 61~90 分鐘 91~120 分鐘 120 分鐘以上
Their responses of these two items were double-checked with the actual transcript files submitted to the researcher. To explore the potential reason why many students dropped out of the study, the researcher performed a product-moment correlation analyses between the duration interacting with the chatbots and the English final semester scores. Table 6 displays the results of the Pearson product-moment correlation.
55
Table 6: Results of the Pearson Product-Moment Correlation between English Semester Scores and the Duration Engaged with Chatbots (n = 42)
Pearson Correlation Semester score Chat duration
Semester score 1 0.30*
Chat duration .30* 1
*p < .05
Table 6 indicated a moderate positive correlation between chat duration and final semester scores at the 0.05 level (r = 0.30; n = 42), suggesting that the participants’ general English proficiency might have been a mild factor during the study, thought this factor was not strong enough to differentiate CEG (M = 77.26; SD = 11.25) and PPG (M = 72.13; SD = 7.87) students as indicated by an independent sample t-test on semester scores (t = -1.68; p = 0.10). Participants who achieved a higher score participated more in the treatment chats. This correlation might explain why some participants were reluctant to interact with the chatbots since lower proficiency participants may have found the bots to be too hard to converse with.
There seems to be a prerequisite of English proficiency to converse with bots. The extracts below show some students’ inability to directly respond to chatbots’ responses. Learners’
responses are all boldfaced and italicized by the researcher.
1.
AI: Do robots dream of electric sheep? I dream of being considered human one day.
USER:SO
AI: Tell me more.
USER:HI AI: Welcome.
USER:THANKS (#01-Pretest with Skynet-AI)
56
2.
AI: You are now on-line.
USER: yes.
AI: Affirmative. What brings you here?
USER: Nice to meet you. (#38-Pretest with Skynet-AI)
3.
Cleverbot: How many people are you talking right now?
USER: Hello (#24-Pretest with Cleverbot)
4.
Meg: Good afternoon. My name is Meg. What's your name?
You: How old are you?
Meg: I'm 18. Is your hair long or short?
You: When is your birthday? (#07-Pretest with Meg)
The above extracts were from the dropout students. The students seem to fail to respond correctly to chatbots’ questions. This might be a reason why they had bad chat experience and decided to stop participation.
The results and discussions of the five research questions are sequentially presented. The five research questions are listed here for reference.
1. Does the participants’ writing complexity change after chatting with chatbots for six weeks?
2. Does the participants’ writing accuracy and appropriateness change after chatting with chatbots for six weeks?
3. Does the participants' writing fluency change after chatting with chatbots for six weeks?
57
4. Do the participants’ writing complexity and fluency change when interacting with different chatbots?
5. Do students perceive the chatbots constructed using varied mechanisms differently?