On the development of a computer-assisted testing system with genetic test sheet-generating approach

(1)

[11] J. Hu and M. P. Wellman, “Multiagent reinforcement learning: Theoret-ical framework and an algorithm,” in Proc. Int. Conf. Machine Learning, 1998, pp. 242–250.

[12] M. Kaya and R. Alhajj, “Reinforcement learning in multiagent systems: A modular fuzzy approach with internal model capabilities,” in Proc. IEEE Int. Conf. Tools Artiﬁcial Intelligence, Nov. 2002, pp. 469–474. [13] M. L. Littman, “Markov games as a framework for multi agent

rein-forcement learning,” in Proc. Int. Conf. Machine Learning, 1994, pp. 157–163.

[14] Y. Nagayuki, S. Ishii, and K. Doya, “Multi-agent reinforcement learning: An approach based on the other agent’s internal model,” in Proc. IEEE Int. Conf. Multiagent Systems, Jul. 2000, pp. 215–221.

[15] P. Stone and M. Veloso, “Multiagent systems: A survey from a machine learning perspective,” Auton. Robotics, vol. 8, no. 3, 2000.

[16] T. W. Sandholm and R. H. Crites, “Multi agent reinforcement learning in the iterated prisoner’s dilemma,” Biosystems, vol. 37, pp. 147–166, 1995.

[17] A. Savasere, E. Omiecinski, and S. Navathe, “An efﬁcient algorithm for mining association rules in large databases,” in Proc. Int. Conf. Very Large Databases, 1995, pp. 432–443.

[18] R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in Proc. ACM SIGMODInt. Conf. Management Data, 1996, pp. 1–12.

[19] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-tion. Cambridge, MA: MIT Press, 1998.

[20] M. Tan, “Multi-agent reinforcement learning: Independent vs. coopera-tive agents,” in Proc. Int. Conf. Machine Learning, 1993, pp. 330–337. [21] C. J. C. H. Watkins and P. Dayan, “Technical note: Q-learning,” Mach.

Learn., vol. 8, pp. 279–292, 1992.

[22] S. Zhang, C. Zhang, and X. Yan, “Post-mining: Maintenance of asso-ciation rules by weighting,” Inform. Syst., vol. 28, no. 7, pp. 691–707, 2003.

On the Development of a Computer-Assisted Testing System With Genetic Test Sheet-Generating Approach

Gwo-Jen Hwang, Bertrand M. T. Lin, Hsien-Hao Tseng, and Tsung-Liang Lin

Abstract—Since the last decade, computer-assisted testing has proven to

be an efficient and effective way to evaluating students’ learning status such that proper tutoring strategies can be adopted to improve their learning performance. A good test will not only help the instructor evaluate the learning status of the students, but also facilitate the diagnosis of the prob-lems embedded in the students’ learning process. One of the most impor-tant and challenging issues in conducting a good test is the construction of test sheets that can meet various assessment requirements. A previous study has indicated that selecting test items to best fit multiple assessment requirements can be formulated as a mixed integer programming model. The problem is known to be NP-hard in the literature and, hence, compu-tational challenges hinder the development of efficient solution methods. As a sequel, we instead seek quality approximate solutions in a reasonable time. Two approximation methods based upon a genetic approach are de-veloped. Statistics from a series of computational experiments indicate that our approach is able to efficiently generate near-optimal combinations of test items that satisfy the specified requirements or constraints.

Index Terms—Computer-assisted testing, genetic algorithm (GA), mixed

integer programming, test sheet generating.

I. BACKGROUND ANDMOTIVATIONS

In recent years, educators have reported the importance of con-ducting an interactive and personalized tutoring process, which is helpful toward the training of creativity and the improvement of learning performance in children. The need for interactive and per-sonalized tutoring environments has encouraged the development of computer-assisted-instruction (CAI) systems which are able to record the learning status of each student and provide adaptive subject ma-terials and practice drills. Therefore, it is very important to precisely determine the learning status of each student so that proper tutoring strategies can be applied accordingly [10], [17]. A high-quality test is the major criterion for determining the learning status of students.

Computer-based tests have been proven to be more effective and ef-ﬁcient than traditional paper-and-pencil tests due to several reasons: First, the test sheets can be composed dynamically based on the prac-tical requirements; second, more plentiful test items can be presented in multimedia styles; third, the student testing portfolio can be recorded and analyzed to improve their learning performance [5], [15], [19].

The key to a high-quality test not only depends on the quality of test items, but also the way the test sheet is constructed [10], [14]. As the number of test items in an item bank is usually large and the number of feasible combinations to form test sheets thus grows exponentially, it is very difﬁcult to ﬁnd an optimal test sheet in a timely manner [3], [6], [11], [12]. Such an issue is likely to grow in importance owning to the rapid advent of Internet technologies and the fast growth of

network-Manuscript received June 15, 2003; revised October 28, 2004. This work was supported by the National Science Council of the Republic of China under Con-tracts NSC-91-2520-S-260-003 and NSC92-2524-S-260-002. This paper was recommended by Guest Editor S. H. Rubin.

G.-J. Hwang is with the Department of Information and Learning Tech-nology, National University of Tainan, Tainan, Taiwan 700, R.O.C. (e-mail: gjhwang@mail.nutn.edu.tw).

B. M. T. Lin is with the Department of Information and Finance Manage-ment, Institute of Information ManageManage-ment, National Chiao Tung University, Hsinchu, Taiwan 300, R.O.C.

H.-H. Tseng and T.-L. Lin are with the Department of Information Manage-ment, National Chi Nan University, Pu-Li, Taiwan 545, R.O.C.

(2)

based educational systems and online learning population. Along with the growth of distance learning through the Internet, computer-based assessment systems are also becoming demanding.

Although many computer-assisted testing systems have been pro-posed, few of them have addressed the problem of systematically com-posing test sheets for multiple assessment requirements [2], [17]. Most of the existing systems construct a test sheet by manually or randomly selecting test items from their item banks. Such manual or random test item selecting strategies are inefﬁcient and usually are not able to si-multaneously meet multiple assessment requirements. Some previous investigations attempted to employ a dynamic programming algorithm to ﬁnd an optimal composition of the test items [11]. As the time com-plexity of the dynamic programming algorithm is exponential in terms of the size of input data, the required execution time will become un-acceptably long if the number of candidate test items is large.

To cope with the increasingly hard situations encountered in devel-oping optimal test sheets, we shall present two mixed integer program-ming models to formulate the problems of finding a set of test items that fit multiple assessment requirements. As the problems are strongly NP-hard, we propose two genetic algorithms [4], [7], [13], [16]–[18] to find quality approximate solutions in acceptable time. Computational experiments will be also presented to study the performances of the proposed algorithms.

II. MIXEDINTEGERPROGRAMMINGMODELS

In an item bank, a subset ofn candidate test items Q₁; Q₂; . . . ; Q_n will be selected for composing a test sheet. In the following subsec-tions, we shall present two models that formulate the test sheet-gen-erating problem under different assessment considerations. The first model was proposed by [12] that is aimed at optimizing the discrimi-nation degree of the generated test sheets with a specified range of as-sessment time and some other multiple constraints. The second model proposed in this paper formulates the optimization of discrimination degree of the generated test sheets with a fixed number of test items as the major constraint.

A. Speciﬁed Length of Assessment Time (SLAT) Problem

In the SLAT problem, the major consideration is to conﬁne the length required by the students to answer the selected items. Assume there are n items in the item bank and m concepts are involved in the test. The variables used in the formulated models are deﬁned as follows:

• Decision variablesx_i; 1 i n : x_i is 1 if test itemi is selected; 0, otherwise.

• Coefﬁcientti; 1 i n: Expected time needed for answering itemQi.

• Coefﬁcientdi; 1 i n: Degree of discrimination of Qi. • Coefﬁcientr_ij; 1 i n; 1 j m: Degree of association

betweenQiand conceptCj.

• Right-hand sideh_j; 1 j m: Lower bound on the expected relevance ofCj.

• Right-hand sidel: Lower bound on the expected time needed for answering the selected items.

• Right-hand sideu: Upper bound on the expected time needed for answering the selected items.

Objective function MaximizeZ = n i=1 dixi n i=1 xi Subject to n i=1 rijxi hj; j = 1; 2; . . . ; m (1) n i=1 tixi l (2) n i=1 tixi u; xi= 0 or 1; i = 1; 2; . . . ; n: (3) In the above formula, binary variablexireﬂects the decision about whether test itemi is included or not. Constraint set (1) indicates that the selected items must have a total relevance no less than the expected relevance to each concept to be addressed. Constraint sets (2) and (3), respectively, specify the lower and upper limits on the time needed to answer the selected items. In the objective function, n_i=1d_ix_iis the total discrimination summing over the selected test items and n_i=1xi is the number of test items selected. Therefore, the objective of this model aims to select a subset of test items such that average discrimi-nation is maximized.

B. Fixed Number of Test Items (FNTI) Problem

In the FNTI problem, the number of test items is specified and fixed asq num n. The variables used in this model are given as follows. • Decision variables:xiis an integer variable that reflects the de-cision about which test item would be selected and designated as questioni; 1 xi n; i = 1; 2; . . . ; q num.

• Right-hand sidehj,1 j m: lower bound on the expected relevance of conceptC_j. Objective function MaximizeZ = q num i=1 dx Subject to q num i=1 rij h; j = 1; 2; . . . ; m (4) x1 1 (5) xi+1> xi; 1 i q num 0 1: (6) In the above formula, constraint set (4) indicates the selected test items must have a total relevance that is no less than the expected rele-vance to each concept to be covered. Constraint sets (5) and (6) indicate that no test item can be selected twice or more. In the objective func-tion, q num_i=1 dx is the total discrimination summing over the selected test items. Therefore, the objective of this model seeks to select a ﬁxed number of test items such that the total discrimination is maximized.

III. GENETICALGORITHMS FORTESTSHEETGENERATION In this section, we shall propose two genetic algorithms (GAs), con-cept lower-bound first genetic approach (CLFG) and feasible item first genetic approach (FIFG) to find quality approximate solutions for the SLAT and FNTI problems. In CLFG, we shall select a set of test items to meet the lower bound on the expected relevance of each concept first, and then substitute some of the selected test items with the candidate test items to meet the upper bound and lower bound of the expected answering time. In FIFG, we confine the number of test items of the test sheet first, and then substitute some of the selected items with the candidate items to meet the lower bound on the expected relevance of each concept.

A. Concept Lower-Bound First Genetic (CLFG) Approach

To cope with the SLAT problem, we propose the CLFG approach as follows:

(3)

1) Step 1. Create Population (Encode): Let variableS denote the

set of initially generated chromosomes and variableK be the size of the population inS. Chromosome S_kis represented as ann-bit binary string[xk1; xk2; . . . ; xkn] consisting of n genes, where xkiis either 1 or 0 indicating that the test item is currently selected or not. An initial set of binary strings, such asX = [0; 0; 1; . . . ; 0], is randomly gener-ated to represent the status of each test item.

2) Step 2. Fitness Ranking: To satisfy the constraints for the lower

bound on the expected relevance of each concept, we deﬁne a penalty function to approximate the constraints asR = dc 2 ipt, where R is a penalty score,dc = m_j=1maxfhj0 n_i=1rijx; 0g is the sum of deviations between the relevance of each concept in the currently selected test items and the corresponding lower bound, and ipt is the penalty weight deﬁned by the instructor.

Moreover, two constraints are needed to specify the penalty values when the total testing time of the selected test items is less than the lower bound or greater than the upper bound. For the selected test items that have a total testing time of less than the lower bound, the penalty function is = w2dtl2ipt l, where w = n_i=1X_id_i=average(u; l) represents the average discrimination weight of a chromosome,dtl = maxfl 0 n

i=1tixi; 0g and ipt l are the user-deﬁned penalty weight penalizing the violation of lower bound constraint.

For the selected test items that have a total testing time greater than the upper bound, the penalty function is = w 2 dtu 2 ipt u, where w = n_i=1Xidi=average(u; l) represents the average dis-crimination weight of chromosomedtu = maxf n_i=1t_ix_i0 u; 0g, and ipt u is a user-deﬁned penalty weight for the case where the upper bound constraint is violated. The evaluation function is aggregated from the aforementioned weights and deﬁned as v(Sk) = ( n_i=1dixi0 0 0 R)= n_i=1xi.

3) Step 3. Selection: The roulette wheel approach is adopted in the

fitness-proportional selection procedure, which selects a new popula-tion with respect to the probability distribupopula-tion based on fitness values. The probability that chromosomeSkis selected and defined aspk = v(Sk)=V , where

V =

pop size+ospring size k=1

v(sk):

4) Step 4. Crossover: The one-cut-point method is used to perform

the “crossover” operation by randomly selecting a cut point and ex-changing the right parts of two parents to generate offsprings. In this application, the value of the crossover rate is 0.2, which was derived from the results of a series of preliminary experiments.

5) Step 5. Mutation: Mutation alters one or more genes with

the mutation rate P = n01. A sequence of real random num-bers y1; y2; . . . ; ynk is then generated with each yi to be a real number in [0, 1]. Ifyi; 1 i nk is greater than P , then the rth, r = i 0 (di=ne 0 1)n, bit of the di=ne chromosome will be complemented.

Steps 2 to 5 constitute a generation. In our procedure, the whole process iterates generation by generation until either no better solution was attained within the most recent ten generations or 1500 genera-tions have been examined. When the procedure stops, the best solution encountered is reported.

B. Feasible Item First Genetic (FIFG) Approach

To cope with the FNTI Problem, we propose a CLFG approach. The GA differs from the previous one in representation, ﬁtness function, and mutation scheme. Therefore, we introduce these parts only.

1) Step 1. Create Population (Encode): Let variable K be the

number of the chromosomes inS, the initial population, and variable

TABLE I

BRIEFDESCRIPTION OFEACHITEMBANK

Skbe thekth chromosome of S. Chromosome Sk is represented as [xk1; xk2; . . . ; xk;q num], consisting of q num genes, each of which denotes a selected item. A set of integers is randomly generated to represent the test item numbers, for example,X = [25; 908; . . . ; 113]. Note thatx_i6= x_j for1 i 6= j q num.

2) Step 2. Fitness Ranking: To satisfy the constraints of the

lower bound on the expected relevance of concept, we deﬁne a penalty function R = dc 2 ipt, where R is a penalty score,

dc = m

j=1maxfh 0 ni=1rx j; 0g, which is the sum of distances between the relevance of each concept for the currently selected test items and the corresponding lower bound, and ipt is the user-deﬁned penalty weight. The evaluation function is deﬁned as

v(Sk) = n i=1

dx 0 R:

3) Step 5. Mutation: “Mutation” operation alters one or more genes

with the mutation rateP = n01. A sequence of real random numbers y1; y2;. . . ; yq num2kis then generated withyibeing a real number in [0, 1], fori = 1 to q num 2 k. A random number selected from 1 to n is used to replace the value of ith gene if yi< P .

IV. EXPERIMENTS ANDEVALUATION

To evaluate the performance of the proposed algorithms, two exper-iments have been conducted to compare the execution time and the so-lution quality of four soso-lution-seeking strategies: CLFG, FIFG, random selection, and exhaustive search. The random selection program gen-erates the test sheet by selecting test items randomly to meet the con-straints of time interval or number of test items, while the exhaustive search program examines every feasible combination of the test items to ﬁnd the optimal solution. Eight item banks of K7 to K9 mathematics courses have been employed in the experiments. Table I shows a brief description of each item bank, where N indicates the total number of test items. The platform of the experiments is a personal computer with a Pentium III 1.0-GHz CPU and 256-MB random-access memory (RAM). The programs are coded in Java Language.

The experiment is conducted by applying CLFG and FIFG twenty times on each item bank with the average execution time and discrimi-nation degree recorded. Tables II–IV show the experimental results for the lower bounds of testing time being 30, 60, and 120 min, respec-tively. It can be seen that for most cases, it is time-consuming to derive optimal solutions. ForN = 30 and l = 60, it takes nearly 3 h (i.e., 187 min) to ﬁnd an optimal solution. Such a lengthy process is obviously unacceptable. When the values ofN and l increase, it becomes very unlikely to ﬁnd optimal solutions in reasonable time. This indicates the need for heuristic algorithms to derive approximate solutions at a cer-tain quality level.

It can be seen that test sheets with near-optimal discrimination de-grees can be obtained in a much shorter time by employing CLFG than by the random selection approach. The line charts also show that the

(4)

TABLE II

EXPERIMENTALRESULTS FORl = 30

TABLE III

TABLE IV

results of CLFG are very close to the known optimal solutions. For each case with more than 250 candidate test items, the execution time for ﬁnding optimal solutions is more than 1 000 000 min, which is not acceptable, while CLFG can still generate test sheets with a degree of discrimination of greater than 0.9 min.

Figs. 1 and 2 depict the chart concerning the execution time of CLFG and that of finding optimum solutions. When the number of candidate test items exceeds 40, it is almost impossible to find an optimal solution, while CLFG can find near-optimal solutions in a very short time (less than 1 min).

Moreover, the statistics show that CLFG can efﬁciently select proper test items from an item bank containing two 500-candidate test items. Even when the number of test items in the item bank increases to 4000, the execution time of CLFG is still acceptable (about 25 to 35 s).

It is also interesting to compare the performances of CLFG and FIFG although they are used to solve different problems with dif-ferent GA representations. In Tables V–VII, the experiment results of FIFG and CLFG are given to compare the execution time and discrimination degree of each generated test sheet. In each table, dc = m

j=1maxfhj0 ni=1rijx; 0g is the sum of distance between the relevance of each concept for the currently selected test items and the corresponding lower bound, andq num is the number of test items selected in the generated test sheet.

Fig. 1. Runtimes of CLFG and Optimum forl = 30.

Fig. 2. Runtimes of CLFG and Optimum forl = 60.

TABLE V

EXPERIMENTALRESULTS FORq num = 18

From Tables V–VII, it can be seen that the discrimination degrees reported by FIFG and CLFG are pretty close to each other. Sometimes the discrimination degree of FIFG even transcends CLFG with less time elapsed. The impacts that the number of candidate test items can im-pose on the runtime of FIFG are not signiﬁcant. That is, as the number of candidate test items increases, FIFG still can demonstrate an impres-sive performance.

V. CONCLUSION ANDFUTUREWORK

In this paper, we have proposed two genetic algorithms: CLFG and FIFG to cope with the test sheet-generating problems. Experimental results show that test sheets with near-optimal discrimination degrees can be obtained in a much shorter time by employing our approaches. The two algorithms have been embedded in a CAI system, Intelli-gent Tutoring, Evaluation, and Diagnosis (ITED-II), to provide a more informative, ﬂexible, and capable tool for the instructors and learners [9]. The testing subsystem of ITED-II accepts assessment requirements and reads the test items from the item bank to generate test sheets. After

(5)

TABLE VI

TABLE VII

conducting a test, the test results are transmitted to the tutoring sub-system for arranging adaptive subject materials. The commercial ver-sion of ITED II is funded by an e-learning company and is scheduled for release in November 2005. This version will incorporate the fol-lowing development strategies.

1) The subject materials and item banks are designed to completely match the contents of textbooks for primary schools and junior high schools.

2) Several functions suggested by the primary school and junior high school teachers, including adaptive learning, adaptive testing, personalized learning diagnosis, and guiding, are provided.

3) A client program is delivered to the teachers and students as a low-price compact-disc read-only memory (CD-ROM) bun-dled to the textbooks. The trial CD-ROM contains limited func-tions to demonstrate part of the subject materials, test items, and learning diagnosis functions.

4) The users need to register to the server for accessing advanced functions and complete subject materials, which are charged by month, semester, or year.

Several other AI- or optimization-based technologies, such as Tabu search, Ant systems, and heuristic algorithms, could be maneuvered to develop more efﬁcient test sheet generating approaches for very large item banks. To facilitate possible comparisons between different problem-solving approaches, the test sheet-generating programs and the database schema of the item bank are available from the corre-sponding author upon request.

REFERENCES

[1] C. Chou, “Constructing a computer-assisted testing and evaluation system on the world wide web-the CATES experience,” IEEE Trans. Educ., vol. 43, no. 3, pp. 266–272, Aug. 2000.

[2] J. M. Feldman and J. Jones Jr., “Semiautomatic testing of student soft-ware under Unix(R),” IEEE Trans. Educ., vol. 40, no. 2, pp. 158–161, May 1997.

[3] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco, CA: Freedman, 1979.

[4] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Ma-chine Learning. Reading, MA: Addison-Wesley, 1989.

[5] A. V. Gonzalez and L. R. Ingraham, “Automated exercise progression in simulation-based training,” IEEE Trans. Syst., Man, Cybern., vol. 24, no. 6, pp. 863–874, Jun. 1994.

[6] F. S. Hillier and G. J. Lieberman, Introduction to Operations Research, 7th ed. New York: McGraw-Hill, 2001.

[7] K. Hitomi, Manufacturing Systems Engineering, 2nd ed. London, U.K.: Taylor & Francis.

[8] S. Hopper, “Cooperative learning and computer-based instruction,” Educ. Technol. Res. Develop., vol. 40, no. 3, pp. 21–38, 1996. 1992. [9] G.-J. Hwang, “On the development of a cooperative tutoring

environ-ment on computer networks,” IEEE Trans. Syst., Man, Cybern. Part C, vol. 32, no. 3, pp. 272–278, Aug. 2002.

[10] , “A concept map model for developing intelligent tutoring sys-tems,” Comput. Educ., vol. 40, no. 3, pp. 217–235, 2003.

[11] , “A test sheet generating algorithm for multiple assessment re-quirements,” IEEE Trans. Educ., vol. 46, no. 3, pp. 329–337, Aug. 2003. [12] G. J. Hwang, T. L. Lin, and B. M. T. Lin, “An effective approach to the composition of test sheets from large item banks,” in Proc. 5th Int. Congr. Industrial Applied Mathematics, Sydney, Australia, July 7–11, 2003.

[13] J. T. Linderoth and M. W. P. Savelsbergh, “A computational study of search strategies for mixed integer programming,” INFORMS J. Comput., vol. 11, no. 2, pp. 173–187, 1999.

[14] P. Lira, M. Bronfman, and J. Eyzaguirre, “MULTITEST II: A program for the generation, correction, and analysis of multiple choice tests,” IEEE Trans. Educ., vol. 33, no. 4, pp. 320–325, Nov. 1990.

[15] J. B. Olsen, D. D. Maynes, D. Slawson, and K. Ho, “Comparison and equating of paper-administered, administered, and computer-ized adaptive tests of achievement,” in Proc. Annu. Meet. American Ed-ucational Research Association, California, Apr. 16–20, 1986. [16] H. R. Parsaei, Genetic Algorithms and Engineering Design. New

York: Wiley, 1997.

[17] K. Rasmussen, P. Northrup, and R. Lee, “Implementing web-based instruction,” in Web-Based Instruction, B. H. Khan, Ed. Englewood Cliffs, NJ: Educational Technology, 1997, pp. 341–346.

[18] N. Singh, Systems Approach to Computer-Integrated Design and Man-ufacturing. New York: Wiley, 1996.

[19] H. Wainer, Computerized Adaptive Testing: A Primer. Hillsdale, NJ: Lawrence Erlbaum Associates, 1990.