An empirical Study - 個人化電腦輔助出題於英文學習之研究

In this chapter, the proposed ability estimation was examined by an empirical

study. To investigate the performance of the proposed method, we will examine the

correlation between the estimated abilities and real data; moreover, we explore the

students’ performance on the post-test and responses among the different ability groups.

Next, the students’ performance was analyzed whether or not appropriate instructional

scaffolding could help students advance their learning; furthermore, we also analyze

whether or not unclear concept will be enhanced by the proposed personalized

com-puter-aided question generation. Finally, user satisfaction will be investigated by a

questionnaire.

7.1 System and materials

The proposed system will be implemented and named as AutoQuiz. It will

pro-vide English language learners with computer-aided question generation. AutoQuiz

will be integrated on the IWiLL learning platform (Kuo et al., 2002), which offers

student, an article from an online news website is selected (see Figure 7a). After

read-ing the article, the examinee will be given a test, consistread-ing of ten vocabulary items

(see Figure 7b), five grammar items (see Figure 7c), and three reading comprehension

items (see Figure 7d). These items are generated automatically and respectively based

on his/her vocabulary, grammar and reading comprehension levels. When the

exami-nee finishes the test, the score and the incorrect responses will be shown (see Figure

7e). In addition, the system also shows an explicit warning near questions that are

in-correctly answered (see the frame in Figure 7e). In order to encourage examinees to

find the answer by themselves, the explicit warning shows the number of mistakes

made rather than the answer, for any questions answered incorrectly less than three

times (after which, the warning will reveal the correct answer). Finally, an error report

button is designed to allow students to report any questionable items (see the circle in

Figure 7b, Figure 7c, and Figure 7d), which experts will then check and remove if

necessary.

A total of 2,481 items, composed of vocabulary, grammar, and reading

compre-hension, were automatically generated based on 72 news stories as reading materials.

These news articles were collected from several global and local online news websites:

Time For Kids (the estimated grade 1-4), Voice of America (the estimated grade 1-6),

China Post Online (the estimated grade 1-6), Yahoo! News (the estimated grade 5-6),

Student Times (the estimated grade 3), and CNN (the estimated grade 5-6).

Figure 7 Snapshots of the system: (a) An example of a given reading materials from

new online website; (b) An example of vocabulary items; (c) An example of grammar

items; (d) An example of reading comprehension items; (e) An example of a score

re-sult with explicit warning.

7.2 Participants and procedure

The participants in this study were the second grade students of senior high

schools in Taiwan, who take English as a foreign language (EFL). During the

experi-ment, the subjects were asked to participate in twelve activities, consisting of reading

an article and then taking a test. Each test was composed of ten vocabulary questions,

five grammar questions and three reading comprehension questions. After each activity,

the proficiency levels of the subjects in the experimental group are estimated. The

grade level in this study is defined from one to six, corresponding to the six semesters

of Taiwanese senior high school. In addition, there were a pre–test and a post–test for

evaluating learner’s proficiency. They were from the College Entrance Examination

and had similar degree of difficulties.

There are two investigations in the empirical study, one is to validate the accuracy

of the proposed ability estimation with real data, and the other is to evaluate the

per-formance of the proposed personalized computer-aided question generation. the

par-ticipants in this study will be divided into two groups: a control group (C1: 30 students)

where ability is estimated only based on current responses, and an experimental group

(E1: 47 students) that incorporates the history record into the current ability estimation.

In the investigation of the personalized computer-aided question generation, the

sub-jects are divided into two groups: a control group with general automatic quiz

genera-tion (quesgenera-tions are generated according to their grades in the school, as the scenario in

the traditional classroom; C2: 21 students), and an experimental group with

personal-ized automatic quiz generation (questions are generated depending on their language

proficiency; E2: 72 students). Noticeably, the subjects in each group are different

per-son.

7.3 The performance of the proposed ability estimation with the empirical da-ta

To validate the accuracy of the proposed ability estimation, the subjects’ abilities

in the two groups will be estimated, one’s is only based on current responses (C1) and

the other incorporates the history record into the current ability estimation (E1). Table

9 reports the Pearson’s correlation coefficient between the estimated abilities (the

esti-mated grade is rounded by the estiesti-mated score) and the post-test scores among the

three quiz types. All of the measures are significantly positively correlated. The results

in the experimental group ranged from 0.44 to 0.69, while ones in the control group

ranged from 0.47 to 0.54. Most of the correlation values in the experimental group are

higher than the values in the control group; this suggests that estimating ability with

the history record leads to a clearer relationship between the estimated ability and the

ground truth.

Table 9 The correlation result between the estimated ability and the post-test in the

control group and the experimental group

vocabulary grammar reading comprehension

score grade score grade score grade

Control group 0.47* 0.49** 0.54** 0.51** 0.54** 0.47*

Experimental group 0.51*** 0.44** 0.55*** 0.55*** 0.69*** 0.65***

*p<0.05, **p<0.01, ***p<0.001

Comparing the post-test score in each estimated ability (grade) is another way to

assess the accuracy of the proposed ability estimation. If the estimated abilities are

ac-curate, the subject performance of each ability will differ from that of other abilities.

Table 10 presents the mean post-test score of the subjects of different estimated

abilities between the control group and the experimental group. Intuitively, a subject

estimated a higher ability should have higher post-test score than one estimated a lower

ability. One-way Analysis of Variance revealed that there were differences in the

esti-mated vocabulary ability (F=5.75, p=0.001), the estiesti-mated grammar ability (F=4.71,

p=0.003) and the estimated reading comprehension ability (F=5.98, p<0.001) in the

experimental group, while there were no statistical differences between the estimated

vocabulary and grammar ability in the control group. Noticeably, although the

esti-mated reading comprehension ability in the control group has a significant difference,

the mean scores among every ability fluctuated. The bolded values in

Table 10 are unreasonable, because the averaged scores of the higher estimated

abilities (grade 2, grade 4 and grade 5) in the control group were lower than ones of the

lower estimated abilities (grade 1 and grade 3). Though there was an unreasonable

value for grade 6 of the estimated vocabulary ability in the experimental group, this is

likely because only two students were assigned to grade 6. This sample size is likely

unrepresentative. Moreover, in the experimental group, a Bonferroni post hoc test

in-dicated that the performance of the estimated ability 1 and 2 were significantly

differ-ent from the estimated ability 5 and 6. This indicates that the proposed ability

estima-tion can effectively distinguish higher ability examinees from lower ones.

Table 10 The mean post-test score of the subjects in different estimated ability groups

between both groups and the result of ANOVA

Estimated

ability

Control group Experimental group

vocabulary grammar reading vocabulary grammar reading

1 - 37.50 46.80 - - 37.67

2 48.33 47.00 40.00 23.00 34.33 46.63

3 38.00 51.40 52.57 52.86 52.80 53.50

4 54.40 41.40 41.00 62.33 54.94 64.50

5 61.22 62.83 32.67 69.71 66.81 66.90

6 65.83 65.56 70.18 57.67 72.00 78.00

F score 2.67 2.54 6.12*** 5.75*** 4.71** 5.98***

**p<0.01, ***p<0.001

To evaluate the validity of the proposed ability estimation, a logistic regression

was performed. Table 11 shows the equations using the ability of a student i and the

difficulty of a question j on the log odds ratio of the observation, which the student i

correctly answers question j is in class 1 or the student incorrectly answers question j is

in class 0. Generally, the probability of which a question can be correctly answered is

relatively higher, when the ability of a student is more advanced. On the other hand,

the more difficult a question is, the lower the probability of which a student correctly

answered a question is. If the observed abilities in the empirical study are precisely

es-timated, the relationship between the estimated abilities and dichotomous outcome will

be explainable. The results showed that the regression coefficients for the ability of

each student among these three question types are positive and the coefficient values

for the difficulty of each questions among these types are negative. Even though the

values among three question types were slightly different, all of them had the same

in-fluence on the dependent variable. This supports the assumption which the estimated

abilities of students were so accurate that they, with advanced proficiencies, could

cor-rectly respond more difficult questions.

Table 11 The equations among question types represent that the log odds ratio of the

observation that the student i correctly answers item j is in class 1 or the student

incor-rectly answers item j is in class 0.

Question types Equations

vocabulary ln(pij /1- pij)=-1.554+1.129studenti-0.321questionj

grammar ln(pij /1- pij)=-1.518+0.859studenti-1.321questionj

reading comprehension ln(pij /1- pij)=-0.178+0.898studenti-0.783questionj

7.4 Student performance

To understand the influence of a personalized automatic quiz generation, we

evaluate the effects of tests on student performance. The scores in the post–test

be-tween the experimental group (E2) and control group (C2) were calculated and

com-pared. In keeping with the previous results, the estimated subjects’ abilities in the

ex-perimental group were more accurate than those in the control group. We assume that

appropriate instructional scaffolding could help students advance their learning, when

effectively identifying their abilities.

Table 12 presents the descriptive statistic and results of a T-test between the

pre-test and post-pre-test. The results of the independent T-pre-test (p=0.92 in the pre-pre-test and

p=0.51 in the post-test) showed a similar effect on the post-test between the

experi-mental group and the control group. One explanation for the results may be rooted in

the short time (only five weeks) allowed for the treatment in the experiment, while

Klinkenberg et al. (2011) conducted one-year experiment and Barla et al. (2010)

em-ployed their method for a winter term course. However, it is noticeable that the average

score of the experimental group in the pretest was lower than the control group, but

that of the experimental group in the post-test made great progress and surpassed the

control group. Additionally, the paired sample T-test showed a significant effect of the

pre-test and the post-test in the experimental group (p<0.001), while the performance

of the control group had no statistically significant effect (p>0.05). This indicates that

the subjects in the experimental group with an appropriate support can exceed the past

themselves when successfully recognizing their learning status.

Table 12 The results of the pretest and post-test between the control group and the

ex-perimental group

Pretest Post-test Paired sample

mean std. mean std. t-test

Control group 53.23 19.35 56.70 17.99 1.57

Experimental group 52.83 16.67 59.28 16.01 3.71***

independent t-test 0.20 0.66

***p<0.001

To further investigate the learning effectiveness, we studied the difference of

stu-dent performance in each difficulty level between the pre–test and post–test. The

number of correctly answered questions among the six difficulty levels in the pre–test

and the post–test were computed. The tests are comprised of 28 items among six

diffi-culty levels (six, three, six, three, seven and three questions per respective level,

cor-responding to levels one through six ). A Chi-Square test for homogeneity of

propor-tions was conducted to analyze the proportion between the pre-test and post-test. Table

13 presents two contingency tables respectively in the control group and the second

graders of the experimental group. The results of the experimental group (²(5)=16.24,

p<0.01) show the significant different proportions between the pre-test and post-test,

while the control group (²(5)=7.46, p>0.05) has a similar percentage among the six

difficulty levels. This change reveals that the adaptive test affects the ability of the

students in the experimental group. To further investigate the difference in the

experi-mental group, a posteriori comparison reveals that the number of correctly answered

questions with level two and level six in the post-test were statistically higher than

those in the pre-test, whereas the number of questions with level one and level four in

the post-test were significantly lower than those in the pre-test. This suggests that the

number questions with higher difficulty level that were correctly answered increased

after the personalized quiz strategy.

Table 13 Contingency tables for the number of correctly answered questions per

diffi-culty level in the pretest and post-test.

Difficulty Level 1 2 3 4 5 6

7.5 Unclear concept enhancement

The aim of the quiz strategy is to enhance students’ understanding of unclear

concepts behind incorrect responses. We measured the rate at which students

success-fully corrected their mistakes on repeated concepts (denoted as the rectification rate) in

the experimental group (E2) and control group (C2), in order to determine the effect of

generating items with repeated concepts and an appropriate difficulty. To make

com-parisons, the independent–samples t–test and the Mann–Whitney U test were both

performed. Ideally, the distribution between the two groups is a normal distribution,

and thereby uses a t–test. However, because of unequal sample sizes, the

nonparamet-ric method is complementary. The results of the rectification rate in the two groups can

be seen in

Table 14. Here, the results suggest that the rectification rate in the experimental

group was on average significantly higher than in the control group (t=6.597, p<0.001

in the independent–samples t–test and Z=-5.974, p<0.001 in the Mann–Whitney U

test). Moreover, the subjects in the experimental group were more than half as likely to

correct unclear concepts and answer similar questions correctly. This indicates that a

personalized approach would help learners correct previous mistakes.

Table 14 The mean and standard deviation of rectification rate.

Group Mean Std.

Experimental group 0.542 0.290

Control group 0.115 0.111

7.6 User satisfaction

In terms of evaluating the performance of the automatic question generation, six

questions in the questionnaire concerning the subjects’ perception will be investigated.

Subjects in the experimental group (E2) will fill out a questionnaire that elicited

in-formation concerning the examinees’ experience and the quality of the generated

ques-tions. Questions in the questionnaire will be taken from (Wilson, Boyd, Chen, & Jamal,

2010). A five–point Likert scale will be employed. From the expectation of the results,

most of the questions will score good results. Table 15 displays the detailed questions

and shows their mean score and standard deviation. From the results, the quality of the

interface and the functionality of the generated questions have high agreement. Most

subjects agreed that the adaptive question selection strategy could help them identify

strengths and weaknesses, so that they could improve their skills and prepare well for

exams. Item six, item seven and item eight assessed the quality of the generated

ques-tions among three categories, and item nine asked the subjects to self-assess their

Eng-lish ability after using the adaptive test environment.

Table 15 Questionnaire results.

ġ Items Mean SD

1 The news interface is easy to use (Wilson et al., 2011). 3.89 0.99

2 The test interface is easy to use (Wilson et al., 2011). 3.86 0.95

3 Taking the quiz has helped me to evaluate my strengths and

weaknesses (Wilson et al., 2011).

4.00 0.67

4 Taking the quiz has helped me to identify areas of knowledge

that need improvement (Wilson et al., 2011).

4.03 0.64

5 Taking the quiz is useful preparation for exams (Wilson et al.,

2011).

3.89 0.7

6a I clearly understood the vocabulary questions on the quiz 3.27 0.99

(Wilson et al., 2011).

6b I clearly understood the grammar questions on the quiz

(Wil-son et al., 2011).

3.46 0.99

6c I clearly understood the reading comprehension questions on

the quiz (Wilson et al., 2011).

3.38 0.95

7a Compare to the traditional manual questions, I can accept the

quality of the vocabulary questions on the quiz.

3.57 0.99

7b Compare to the traditional manual questions, I can accept the

quality of the grammar questions on the quiz.

3.38 1.11

7c Compare to the traditional manual questions, I can accept the

quality of the reading comprehension questions on the quiz.

3.59 1.04

8a Compare to the traditional manual questions, I agree with the

quality of the vocabulary questions are comparable.

3.59 0.96

8b Compare to the traditional manual questions, I agree with the

quality of the grammar questions are comparable.

3.43 1.04

8c Compare to the traditional manual questions, I agree with the 3.46 1.07

quality of the reading comprehension questions are

compara-ble.

9a I feel that I have made a progress in the vocabulary skills. 3.62 0.79

9b I feel that I have made a progress in the grammar skills. 3.41 0.76

9c I feel that I have made a progress in the reading

comprehen-sion skills.

3.81 0.81

Figure 8 displays charts of these items, with responses ranging from strongly

agree (5) to strongly disagree (1) for questions on vocabulary, grammar and reading

comprehension items. More than 80% of the participants understood the generated

questions and agreed these questions were acceptable. Compared to traditional manual

questions, automatic generated items were viewed as acceptable, especially for

vocab-ulary items, which 92% of subjects believed were close to the traditional items. This

information supports the performance of the proposed automatic question generation

and represents the usefulness of the generated questions. Finally, the results show that

more than 90% of examinees felt that their English had progressed.

Figure 8 The charts on the percentage value vary from strongly agree to the strongly

disagree for item six (upper left), item seven (upper right), item eight (lower left) and

item nine (lower right).

在文檔中個人化電腦輔助出題於英文學習之研究 (頁 119-142)