Chatbots’ Influence on Grammar Accuracy and Message Appropriateness

Though Lu’s (2010) L2 Syntactic Complexity Analyzer (L2SCA) depicted the complexity level of the submitted texts using mostly endorsed indices, the analyzer itself did not consider grammar accuracy or message appropriateness. Hence, the researcher did further analyses on the pretest and posttest transcripts. The researcher only used transcripts with Cleverbot and Skynet-AI for analyses due to two reasons. First, the questionnaire results indicated that Meg was significantly perceived poorer than Skynet-AI and Cleverbot in supplying appropriate responses. Since learners might be confused when encountering many inappropriate messages from Meg, students might not be able to stay on topic and produced coherent messages. Secondly, Meg’s special feature of offering a list of suggested questions encouraged several participants to simply copy and paste the suggested sentences instead of producing the messages themselves. Thus, students’ sentences in the transcripts with Meg might not reflect their own grammar accuracy. The copied sentences were deleted from the data in the complexity analyses but might influence the credibility of the following analyses since the total number of learners’ responses would be involved to calculate grammatical accuracy and appropriateness.

The researcher first evaluated all the 128 transcripts (CEG = 80; PPG = 48) from pre- and post-tests and then a second rater from the same institute blindly evaluated 12% of transcripts (CEG = 8; PPG = 8) to establish inter-rater reliability.

Table 10 shows the inter-rater reliability analysis using Pearson product- moment correlation. Both the ratings for message appropriateness and grammar accuracy were significantly correlated; thus, the researcher proceeded with the analyses.

Table 10: Inter-rater Reliability Analyses (n = 16)

**p < .01; ***p< .001

Table 11 depicts the results of within group analyses of grammar accuracy and message appropriateness. As can be seen from Table 11, error-free sentence ratio (EFS/S) significantly increased at the level of 0.01 for CEG but nonsignificant for PPG.

Table 11: Results of Paired Sample T-tests on the Pre- and Post-tests on Grammar Accuracy and Message Appropriateness

Pretest Posttest

Accuracy Indices Groups M SD M SD t

Error-free sentence ratio (EFS/S) CEG 0.88 0.11 0.93 0.05 -2.86**

PPG 0.85 0.13 0.89 0.09 -1.30^n.s.

Appropriate sentence ratio (AS/S) CEG 0.93 0.06 0.93 0.07 0.23^n.s.

PPG 0.96 0.05 0.92 0.08 2.62*

Note. CEG = Complete Experimental Group (n = 20);

PPG = Partial Participation Group (n =12);

M = the sum mean from Skynet-AI and Cleverbot;

*p< .05; **p< .01; n.s. = p > .05

Appropriate sentence ratio (AS/S), on the other hand, showed the opposite picture. For CEG students, their appropriateness remained high and intact, and thus no significant difference was detected. But message appropriateness decreased at the 0.05 level for PPG students, suggesting that in the posttest, the students who failed to finish all six required chats, tended to have decreased appropriateness.

Ratings Statistical tests r

Message appropriateness Pearson correlation .69**

Grammar accuracy Pearson correlation .82***

Although the paired sample t-tests demonstrated the learning within CEG and PPG, the results in Table 11 only showed students’ general learning with chatbots as well as classroom instruction. To partial out the factor of classroom instruction to more clearly observe the influence of chatbots, the researcher first performed independent sample t-tests on the pretest scores of both error- free sentence ratio (EFS/S) and appropriate sentence ratio (AS/S). Table 12 shows the results of independent sample t-tests between groups in the pretest.

Table 12: Independent Sample T-tests of Grammar Accuracy and Message Appropriateness in the Pretest

Indices Groups M SD t

Error-free sentence ratio (EFS/S) CEG 0.88 0.11 2.07^n.s.

PPG 0.85 0.13

Appropriate sentence ratio (AS/S) CEG 0.93 0.06 -0.84*

PPG 0.96 0.05 Note. CEG = Complete Experimental Group (n = 20);

PPG = Partial Participation Group (n =12);

M = the sum mean from Skynet-AI and Cleverbot;

*p< .05; n.s. = p > .05

As indicated by Table 12, both CEG and PPG did not differ in error- free sentence ratio (EFS/S) but they did on appropriate sentence ratio (AS/S). To ensure comparability of appropriate sentence ratio (AS/S) in the posttest, the researcher performed one-way analysis of covariance (ANCOVA) using the pretest appropriate sentence ratio (AS/S) scores as the covariate to adjust for initial proficiency difference. An independent sample t-test was computed for error-free sentence ratio (EFS/S) between groups. The results are shown in Table 13. It can be observed that when both groups were compared, the growth of grammar accuracy remained for CEG, but appropriateness was not significant.

Table 13: Independent Sample T-test of Grammar Accuracy and One-way ANCOVA of Message Appropriateness in the Posttest

Indices Groups M SD t

Error-free sentence ratio (EFS/S) CEG 0.93 0.05 -2.29*

PPG 0.89 0.09

Indices Groups M SD F

Appropriate sentence ratio (AS/S) CEG 0.93 0.07 1.52^n.s.

PPG 0.92 0.08 Note. CEG = Complete Experimental Group (n = 20);

PPG = Partial Participation Group (n =12);

M = the sum mean from Skynet-AI and Cleverbot;

*p< .05; n.s. = p > .05

Table 14 lists the tally count of four types of errors spotted in the pre- and post-tests of CEG. Syntactic, morphological, and lexical errors all decreased to some degree in the posttest but spelling errors grew twice as many as the pretest in raw tally count. Because CEG students produced more sentences in the posttest, though the total tally count was higher in the posttest, the number of errors per sentence actually dropped in the posttest.

Table 14: CEG Students’ Error Tally Count and Error per Sentence Ratio in the Pre- and Post-tests (n = 20)

Tally Count

Error Types and Indices Pretest Posttest

Total of Syntactic Errors 30.00 21.00

Total of Morphological Errors 19.00 16.00

Total of Lexical Errors 10.00 9.00

Total of Spelling Errors 20.00 39.00

Total of all four types of Errors 79.00 85.00

Mean of Sentences 42.39 76.25

Mean of Errors Per Sentence 1.86 1.11

Note. The data were based on the transcripts with Cleverbot and Skynet-AI only

Syntactic, morphological, and lexical errors seem to reduce as a result of chatting with chatbots. The researcher went back to examine all the transcripts of students who made progress in accuracy and found that CEG students corrected some errors they made in the pretest; also, occasionally, chatbots’ input might have pushed students to notice their own errors and correct them in the following chats. Some examples of this change are listed below.

Chatbots’ input that might be helpful to students is italicized and boldfaced.

Pretest:

AI: What are “Human will power is often tested.”? Why do you ask?

USER: i don’t know what are you mean Posttest:

AI: I don’t think I have adapted my AI to discuss that.

USER: Why

AI: What don’t you get?

USER: i do not know what you mean. (#26-Skynet-AI)

Pretest:

Cleverbot: I learn from humans. What have you learned?

User: Everyone who was smart then me.

Posttest:

Cleverbot: How clever are you?

User: Better than you. (#26-Cleverbot)

Pretest:

User: Are you (a) boy or (a) girl?

Cleverbot: I am a boy.

User: Are you (a) girl?

Cleverbot: Yes Posttest:

User: I am so happy.

Cleverbot: Me too.

User: An apple a day keep the doctor away. (#36-Cleverbot)

Pretest:

USER: are you (a) good guy?

AI: Sure, you can describe me as good. How do you feel today.

Posttest:

USER: what is a terminator?

AI: Terminator? (#36-Skynet-AI)

Pretest:

Cleverbot: A robot would say that.

User: I (am) really a human.

Cleverbot: You are a robot.

Posttest:

Cleverbot: So you admit that I am more clever than you?

User: No.

Cleverbot: Why?

User: I am smarter than you. (#38-Cleverbot)

Pretest:

Cleverbot: How am I not real?

User: Because you are a rober (robot).

Cleverbot: How?

User: Robber (robot).

…

Cleverbot: You are a robot.

User: No! I’m not a robot. (#40-Cleverbot)

Pretest:

USER: do you know how to play game(s).

AI: How to play game? I will look for instructions and attempt to find a solution.

Posttest:

USER: do you know special force?

AI: I am not sure. Does special force interest you?

USER: do you know special forces games.

AI: I love games. (#7-Skynet-AI)

USER: OK~, can you prove (“prove” is a transitive verb)?

AI: I can, can you?

…

AI: Prove to me that you actually exist.

(a correct usage of “prove” demonstrated by the bot) (#35-pretest-Skynet-AI)

Cleverbot: Who are you?

User: I am you father.

Cleverbot: Nonsense. You're not my father.

User: Yes I am you father.

Cleverbot: No, I'm your father. (#02-posttest-Cleverbot)

Figure 11 depicts the pre- and post-test score change of CEG and PPG in grammar accuracy. No difference was found in the pretest while a difference was found in the posttest.

Figure 11: The Error-free Sentence Ratio (EFS/S) Change between Pre- and Post-tests

The increase of grammar accuracy might be due to the input of chatbots’ language.

Though the current study did not set out to comprehensively find all the traces of students noticing their incongruous output from chatbots’ correct language, the researcher found some extracts that supported the phenomenon of students presumably noticing the correct forms used by the chatbots and thus self- modified their mental grammar as demonstrated by their correct production after they noticed the chatbots’ correct forms.

Two extracts were presented below to support this possibility. Extract 1 showed that learners might have noticed the need for a copula right before a comparative adjective from reading Cleverbot’s message “So you admit that I am more clever than you?” and self-corrected the error and produced a correct variation “I am smarter than you.” The second

extract illustrates the possibility for learners to learn correct spelling from chatbots. Student

#40 misspelt “robot” twice as “rober” and “robber” but self- corrected the mistake after observing Cleverbot’s message, “You are a robot.”

Pretest:

Cleverbot: A robot would say that.

User: I (am) really a human.

Cleverbot: You are a robot.

Posttest:

Cleverbot: So you admit that I am more clever than you?

User: No.

Cleverbot: Why?

User: I am smarter than you. (#38-Cleverbot)

Cleverbot: How am I not real?

User: Because you are a rober (robot).

Cleverbot: How?

User: Robber (robot).

…

Cleverbot: You are a robot.

User: No! I’m not a robot. . (#40-pretest-Cleverbot)

Thus, the finding regarding grammar accuracy matched the strand of interactionism proposing the importance of interactions of learner and interlocutor (Ellis, 2008; Lightbown

& Spada, 2004). The other strand of interactionism proposed the benefit of modified input, but the researcher found the chatbots were unable to offer modified input for better comprehension. The examples below show chatbots’ inability to use modified English to help students understand.

Cleverbot: Do you love soft things?

User: What is soft things?

Cleverbot: You aren't thinking.

User: What?

Cleverbot: I said, you are not thinking.

User: I don't know what is soft things.

Cleverbot: What should we do?

User: I don't know.

Cleverbot: You never know, you fool. (#38-treatment chat-Cleverbot)

AI: Good. Bots often suffer from the pinocchio complex!

USER:pinocchio complex!??

AI: Pinocchio complex? Why do you ask?

USER:i do not what you say

AI: You value my opinion don't you. (#37-treatment chat-Skynet-AI)

AI: Wow! What were you up to today?

USER:what do you mean AI: You know what I mean.

USER:but i do not know whatyou mean (#37-treatment chat-Skynet-AI)

Extract 3 to Extract 5 all showed chatbots’ incompetence in paraphrasing themselves using modified English when students asked for clarification. Thus, the evidence collected only supported the strand of interactionism that emphasizes the importance of interpersonal interactions.

On the other hand, message appropriateness did not change even when initial cross group difference was controlled using a covariate. Figure 12 shows the pre- and post-test change of both CEG and PPG students.

Figure 12: The Appropriate Sentence Ratio (AS/S) Change between Pre- and Post-tests

The pretest showed a moderate difference between PPG and CEG; thus, the researcher applied the pretest score as the covariate and compared the posttest scores. However, no difference was found in appropriateness.

在文檔中英語聊天機器人對台灣高中生英語學習之效益探討 (頁 75-87)