In this chapter, the proposed ability estimation was examined by an empirical
study. To investigate the performance of the proposed method, we will examine the
correlation between the estimated abilities and real data; moreover, we explore the
students’ performance on the post-test and responses among the different ability groups.
Next, the students’ performance was analyzed whether or not appropriate instructional
scaffolding could help students advance their learning; furthermore, we also analyze
whether or not unclear concept will be enhanced by the proposed personalized
com-puter-aided question generation. Finally, user satisfaction will be investigated by a
questionnaire.
7.1 System and materials
The proposed system will be implemented and named as AutoQuiz. It will
pro-vide English language learners with computer-aided question generation. AutoQuiz
will be integrated on the IWiLL learning platform (Kuo et al., 2002), which offers
student, an article from an online news website is selected (see Figure 7a). After
read-ing the article, the examinee will be given a test, consistread-ing of ten vocabulary items
(see Figure 7b), five grammar items (see Figure 7c), and three reading comprehension
items (see Figure 7d). These items are generated automatically and respectively based
on his/her vocabulary, grammar and reading comprehension levels. When the
exami-nee finishes the test, the score and the incorrect responses will be shown (see Figure
7e). In addition, the system also shows an explicit warning near questions that are
in-correctly answered (see the frame in Figure 7e). In order to encourage examinees to
find the answer by themselves, the explicit warning shows the number of mistakes
made rather than the answer, for any questions answered incorrectly less than three
times (after which, the warning will reveal the correct answer). Finally, an error report
button is designed to allow students to report any questionable items (see the circle in
Figure 7b, Figure 7c, and Figure 7d), which experts will then check and remove if
necessary.
A total of 2,481 items, composed of vocabulary, grammar, and reading
compre-hension, were automatically generated based on 72 news stories as reading materials.
These news articles were collected from several global and local online news websites:
Time For Kids (the estimated grade 1-4), Voice of America (the estimated grade 1-6),
China Post Online (the estimated grade 1-6), Yahoo! News (the estimated grade 5-6),
Student Times (the estimated grade 3), and CNN (the estimated grade 5-6).
Figure 7 Snapshots of the system: (a) An example of a given reading materials from
new online website; (b) An example of vocabulary items; (c) An example of grammar
items; (d) An example of reading comprehension items; (e) An example of a score
re-sult with explicit warning.
7.2 Participants and procedure
The participants in this study were the second grade students of senior high
schools in Taiwan, who take English as a foreign language (EFL). During the
experi-ment, the subjects were asked to participate in twelve activities, consisting of reading
an article and then taking a test. Each test was composed of ten vocabulary questions,
five grammar questions and three reading comprehension questions. After each activity,
the proficiency levels of the subjects in the experimental group are estimated. The
grade level in this study is defined from one to six, corresponding to the six semesters
of Taiwanese senior high school. In addition, there were a pre–test and a post–test for
evaluating learner’s proficiency. They were from the College Entrance Examination
and had similar degree of difficulties.
There are two investigations in the empirical study, one is to validate the accuracy
of the proposed ability estimation with real data, and the other is to evaluate the
per-formance of the proposed personalized computer-aided question generation. the
par-ticipants in this study will be divided into two groups: a control group (C1: 30 students)
where ability is estimated only based on current responses, and an experimental group
(E1: 47 students) that incorporates the history record into the current ability estimation.
In the investigation of the personalized computer-aided question generation, the
sub-jects are divided into two groups: a control group with general automatic quiz
genera-tion (quesgenera-tions are generated according to their grades in the school, as the scenario in
the traditional classroom; C2: 21 students), and an experimental group with
personal-ized automatic quiz generation (questions are generated depending on their language
proficiency; E2: 72 students). Noticeably, the subjects in each group are different
per-son.
7.3 The performance of the proposed ability estimation with the empirical da-ta
To validate the accuracy of the proposed ability estimation, the subjects’ abilities
in the two groups will be estimated, one’s is only based on current responses (C1) and
the other incorporates the history record into the current ability estimation (E1). Table
9 reports the Pearson’s correlation coefficient between the estimated abilities (the
esti-mated grade is rounded by the estiesti-mated score) and the post-test scores among the
three quiz types. All of the measures are significantly positively correlated. The results
in the experimental group ranged from 0.44 to 0.69, while ones in the control group
ranged from 0.47 to 0.54. Most of the correlation values in the experimental group are
higher than the values in the control group; this suggests that estimating ability with
the history record leads to a clearer relationship between the estimated ability and the
ground truth.
Table 9 The correlation result between the estimated ability and the post-test in the
control group and the experimental group
vocabulary grammar reading comprehension
score grade score grade score grade
Control group 0.47* 0.49** 0.54** 0.51** 0.54** 0.47*
Experimental group 0.51*** 0.44** 0.55*** 0.55*** 0.69*** 0.65***
*p<0.05, **p<0.01, ***p<0.001
Comparing the post-test score in each estimated ability (grade) is another way to
assess the accuracy of the proposed ability estimation. If the estimated abilities are
ac-curate, the subject performance of each ability will differ from that of other abilities.
Table 10 presents the mean post-test score of the subjects of different estimated
abilities between the control group and the experimental group. Intuitively, a subject
estimated a higher ability should have higher post-test score than one estimated a lower
ability. One-way Analysis of Variance revealed that there were differences in the
esti-mated vocabulary ability (F=5.75, p=0.001), the estiesti-mated grammar ability (F=4.71,
p=0.003) and the estimated reading comprehension ability (F=5.98, p<0.001) in the
experimental group, while there were no statistical differences between the estimated
vocabulary and grammar ability in the control group. Noticeably, although the
esti-mated reading comprehension ability in the control group has a significant difference,
the mean scores among every ability fluctuated. The bolded values in
Table 10 are unreasonable, because the averaged scores of the higher estimated
abilities (grade 2, grade 4 and grade 5) in the control group were lower than ones of the
lower estimated abilities (grade 1 and grade 3). Though there was an unreasonable
value for grade 6 of the estimated vocabulary ability in the experimental group, this is
likely because only two students were assigned to grade 6. This sample size is likely
unrepresentative. Moreover, in the experimental group, a Bonferroni post hoc test
in-dicated that the performance of the estimated ability 1 and 2 were significantly
differ-ent from the estimated ability 5 and 6. This indicates that the proposed ability
estima-tion can effectively distinguish higher ability examinees from lower ones.
Table 10 The mean post-test score of the subjects in different estimated ability groups
between both groups and the result of ANOVA
Estimated
ability
Control group Experimental group
vocabulary grammar reading vocabulary grammar reading
1 - 37.50 46.80 - - 37.67
2 48.33 47.00 40.00 23.00 34.33 46.63
3 38.00 51.40 52.57 52.86 52.80 53.50
4 54.40 41.40 41.00 62.33 54.94 64.50
5 61.22 62.83 32.67 69.71 66.81 66.90
6 65.83 65.56 70.18 57.67 72.00 78.00
F score 2.67 2.54 6.12*** 5.75*** 4.71** 5.98***
**p<0.01, ***p<0.001
To evaluate the validity of the proposed ability estimation, a logistic regression
was performed. Table 11 shows the equations using the ability of a student i and the
difficulty of a question j on the log odds ratio of the observation, which the student i
correctly answers question j is in class 1 or the student incorrectly answers question j is
in class 0. Generally, the probability of which a question can be correctly answered is
relatively higher, when the ability of a student is more advanced. On the other hand,
the more difficult a question is, the lower the probability of which a student correctly
answered a question is. If the observed abilities in the empirical study are precisely
es-timated, the relationship between the estimated abilities and dichotomous outcome will
be explainable. The results showed that the regression coefficients for the ability of
each student among these three question types are positive and the coefficient values
for the difficulty of each questions among these types are negative. Even though the
values among three question types were slightly different, all of them had the same
in-fluence on the dependent variable. This supports the assumption which the estimated
abilities of students were so accurate that they, with advanced proficiencies, could
cor-rectly respond more difficult questions.
Table 11 The equations among question types represent that the log odds ratio of the
observation that the student i correctly answers item j is in class 1 or the student
incor-rectly answers item j is in class 0.
Question types Equations
vocabulary ln(pij /1- pij)=-1.554+1.129studenti-0.321questionj
grammar ln(pij /1- pij)=-1.518+0.859studenti-1.321questionj
reading comprehension ln(pij /1- pij)=-0.178+0.898studenti-0.783questionj
7.4 Student performance
To understand the influence of a personalized automatic quiz generation, we
evaluate the effects of tests on student performance. The scores in the post–test
be-tween the experimental group (E2) and control group (C2) were calculated and
com-pared. In keeping with the previous results, the estimated subjects’ abilities in the
ex-perimental group were more accurate than those in the control group. We assume that
appropriate instructional scaffolding could help students advance their learning, when
effectively identifying their abilities.
Table 12 presents the descriptive statistic and results of a T-test between the
pre-test and post-pre-test. The results of the independent T-pre-test (p=0.92 in the pre-pre-test and
p=0.51 in the post-test) showed a similar effect on the post-test between the
experi-mental group and the control group. One explanation for the results may be rooted in
the short time (only five weeks) allowed for the treatment in the experiment, while
Klinkenberg et al. (2011) conducted one-year experiment and Barla et al. (2010)
em-ployed their method for a winter term course. However, it is noticeable that the average
score of the experimental group in the pretest was lower than the control group, but
that of the experimental group in the post-test made great progress and surpassed the
control group. Additionally, the paired sample T-test showed a significant effect of the
pre-test and the post-test in the experimental group (p<0.001), while the performance
of the control group had no statistically significant effect (p>0.05). This indicates that
the subjects in the experimental group with an appropriate support can exceed the past
themselves when successfully recognizing their learning status.
Table 12 The results of the pretest and post-test between the control group and the
ex-perimental group
!
Pretest Post-test Paired sample
mean std. mean std. t-test
Control group 53.23 19.35 56.70 17.99 1.57
Experimental group 52.83 16.67 59.28 16.01 3.71***
independent t-test 0.20 0.66
***p<0.001
To further investigate the learning effectiveness, we studied the difference of
stu-dent performance in each difficulty level between the pre–test and post–test. The
number of correctly answered questions among the six difficulty levels in the pre–test
and the post–test were computed. The tests are comprised of 28 items among six
diffi-culty levels (six, three, six, three, seven and three questions per respective level,
cor-responding to levels one through six ). A Chi-Square test for homogeneity of
propor-tions was conducted to analyze the proportion between the pre-test and post-test. Table
13 presents two contingency tables respectively in the control group and the second
graders of the experimental group. The results of the experimental group (2(5)=16.24,
p<0.01) show the significant different proportions between the pre-test and post-test,
while the control group (2(5)=7.46, p>0.05) has a similar percentage among the six
difficulty levels. This change reveals that the adaptive test affects the ability of the
students in the experimental group. To further investigate the difference in the
experi-mental group, a posteriori comparison reveals that the number of correctly answered
questions with level two and level six in the post-test were statistically higher than
those in the pre-test, whereas the number of questions with level one and level four in
the post-test were significantly lower than those in the pre-test. This suggests that the
number questions with higher difficulty level that were correctly answered increased
after the personalized quiz strategy.
Table 13 Contingency tables for the number of correctly answered questions per
diffi-culty level in the pretest and post-test.
Difficulty Level 1 2 3 4 5 6
7.5 Unclear concept enhancement
The aim of the quiz strategy is to enhance students’ understanding of unclear
concepts behind incorrect responses. We measured the rate at which students
success-fully corrected their mistakes on repeated concepts (denoted as the rectification rate) in
the experimental group (E2) and control group (C2), in order to determine the effect of
generating items with repeated concepts and an appropriate difficulty. To make
com-parisons, the independent–samples t–test and the Mann–Whitney U test were both
performed. Ideally, the distribution between the two groups is a normal distribution,
and thereby uses a t–test. However, because of unequal sample sizes, the
nonparamet-ric method is complementary. The results of the rectification rate in the two groups can
be seen in
Table 14. Here, the results suggest that the rectification rate in the experimental
group was on average significantly higher than in the control group (t=6.597, p<0.001
in the independent–samples t–test and Z=-5.974, p<0.001 in the Mann–Whitney U
test). Moreover, the subjects in the experimental group were more than half as likely to
correct unclear concepts and answer similar questions correctly. This indicates that a
personalized approach would help learners correct previous mistakes.
Table 14 The mean and standard deviation of rectification rate.
Group Mean Std.
Experimental group 0.542 0.290
Control group 0.115 0.111
7.6 User satisfaction
In terms of evaluating the performance of the automatic question generation, six
questions in the questionnaire concerning the subjects’ perception will be investigated.
Subjects in the experimental group (E2) will fill out a questionnaire that elicited
in-formation concerning the examinees’ experience and the quality of the generated
ques-tions. Questions in the questionnaire will be taken from (Wilson, Boyd, Chen, & Jamal,
2010). A five–point Likert scale will be employed. From the expectation of the results,
most of the questions will score good results. Table 15 displays the detailed questions
and shows their mean score and standard deviation. From the results, the quality of the
interface and the functionality of the generated questions have high agreement. Most
subjects agreed that the adaptive question selection strategy could help them identify
strengths and weaknesses, so that they could improve their skills and prepare well for
exams. Item six, item seven and item eight assessed the quality of the generated
ques-tions among three categories, and item nine asked the subjects to self-assess their
Eng-lish ability after using the adaptive test environment.
Table 15 Questionnaire results.
ġ Items Mean SD
1 The news interface is easy to use (Wilson et al., 2011). 3.89 0.99
2 The test interface is easy to use (Wilson et al., 2011). 3.86 0.95
3 Taking the quiz has helped me to evaluate my strengths and
weaknesses (Wilson et al., 2011).
4.00 0.67
4 Taking the quiz has helped me to identify areas of knowledge
that need improvement (Wilson et al., 2011).
4.03 0.64
5 Taking the quiz is useful preparation for exams (Wilson et al.,
2011).
3.89 0.7
6a I clearly understood the vocabulary questions on the quiz 3.27 0.99
(Wilson et al., 2011).
6b I clearly understood the grammar questions on the quiz
(Wil-son et al., 2011).
3.46 0.99
6c I clearly understood the reading comprehension questions on
the quiz (Wilson et al., 2011).
3.38 0.95
7a Compare to the traditional manual questions, I can accept the
quality of the vocabulary questions on the quiz.
3.57 0.99
7b Compare to the traditional manual questions, I can accept the
quality of the grammar questions on the quiz.
3.38 1.11
7c Compare to the traditional manual questions, I can accept the
quality of the reading comprehension questions on the quiz.
3.59 1.04
8a Compare to the traditional manual questions, I agree with the
quality of the vocabulary questions are comparable.
3.59 0.96
8b Compare to the traditional manual questions, I agree with the
quality of the grammar questions are comparable.
3.43 1.04
8c Compare to the traditional manual questions, I agree with the 3.46 1.07
quality of the reading comprehension questions are
compara-ble.
9a I feel that I have made a progress in the vocabulary skills. 3.62 0.79
9b I feel that I have made a progress in the grammar skills. 3.41 0.76
9c I feel that I have made a progress in the reading
comprehen-sion skills.
3.81 0.81
Figure 8 displays charts of these items, with responses ranging from strongly
agree (5) to strongly disagree (1) for questions on vocabulary, grammar and reading
comprehension items. More than 80% of the participants understood the generated
questions and agreed these questions were acceptable. Compared to traditional manual
questions, automatic generated items were viewed as acceptable, especially for
vocab-ulary items, which 92% of subjects believed were close to the traditional items. This
information supports the performance of the proposed automatic question generation
and represents the usefulness of the generated questions. Finally, the results show that
more than 90% of examinees felt that their English had progressed.
Figure 8 The charts on the percentage value vary from strongly agree to the strongly
disagree for item six (upper left), item seven (upper right), item eight (lower left) and
item nine (lower right).