• 沒有找到結果。

A random classification of an essay into four categories would have achieved only 25% in accuracy on average. We considered features at the word, sentence, and essay levels in this classification task, and we found that it was possible to improve the F measure from 0.381 (Table 6) to 0.536 (Table 11). The best F measures were observed in 10-fold cross-validation tests for LMT in Weka. Not all classifiers achieved the same quality of classification. Among the four types of classifiers we used in this study, LMT performed the best on average.

The identified improvement was not small, but it was not significant enough either. The problem of determining levels of readability may not be as easy as the public scores suggested.

We analyzed our corpus with the SMOG scores in Section 5.1, and found that the essays of supposedly more challenging levels may not have higher SMOG scores than the scores of the supposedly easer essays.

Figure 8. Readability scores of more popular formulae

We explored two additional scores for readability. In Figure 8, we show the SMOG, FKGL19, and ARI20 scores for 100 arbitrarily chosen essays from our corpus. The curves show rather strong similarity, which is not very surprising to us. These score functions rely mainly

       

19 Flesch-Kincaid Grade Level. http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test

20 Automated readability index. http://en.wikipedia.org/wiki/Automated_Readability_Index

0 5 10 15 20

1 10 19 28 37 46 55 64 73 82 91 100 S

c o r e s

Essays

SMOG FKGL ARI

216 Wei-Ti Kuo et al.

on the word counts of different levels of words and the number of sentences in an essay.

Hence, if using SMOG would not achieve good results for the classification task in our study (cf. Figure 4), then using the other two alternatives would not achieve much better results either.

One challenge to our work is whether we should consider only the short essays and classify the levels of the comprehension tests. A comprehension test contains the essay part and the question part. Obviously, we should take the questions into consideration in the classification task, which we have not begun yet. In addition, due to the

“examination-centered” style of education in Taiwan, the same short essay may be reused in tests of students of higher classes. Such a reuse of short essays made our classification more difficult, because that made the “correct class” of an essay rather ambiguous.

Whether linguistic features were sufficient for the determination of readability of essays is also an issue. Understanding an essay may require domain-dependent knowledge that we have not attempted to encode with our features (Carrell, 1983). Culture-dependent issues may also play a role (Carrell, 1981). Hence, more features are needed to accomplish more improvement on the predication of readability, e.g. (Crossley et al., 2008; Zhang, 2008).

A review comment suggested that there might not be sufficient differences in the short essays used in the first and the second semesters of a school year, so trying to classify the short essays into three levels (each for a school year) may be more practical. Although we did not move our work in this direction, we think the suggestion is interesting.

A reviewer noticed an interesting crossing point in Figure 4. The SMOG score at 11.5 seems to be a major point for the curves in Figure 4 to intersect. A similar phenomenon appeared in Figure 8, where approximately half of the scores of the 100 essays were above 11.5. Whether 11.5 is the watershed of the easy and difficult essays is an interesting hypothesis to verify with a larger amount of essays.

Acknowledgments

The work was supported in part by the funding from the National Science Council in Taiwan under the contracts NSC-97-2221-004-007, NSC-98-2815-C-004-003-E, and NSC-99-2221-004-007. The authors would like to thank Miss Min-Hua Lai for her technical support in this study and Professor Zhao-Ming Gao for his comments on an earlier report (Kuo et al., 2009) of this paper.

Using Linguistic Features to Predict Readability of Short Essays for 217 Senior High School Students in Taiwan

References

Attali, Y. & Burstein, J. (2006). Automated essay scoring with e-rater V.2, Journal of Technology, Learning, and Assessment, 4(3), 3-30.

Bailin, A. & Grafstein, A. (2001). The linguistic assumptions underlying readability formulae:

A critique, Language and Communication, 21(2), 285-301, 2001.

Burstein, J., Marcu, D., & Knight, K. (2003). Finding the WRITE stuff: Automatic identification of discourse structure in student essays, IEEE Intelligent Systems, 18(1), 32-39.

Carrell, P. L. (1981). Culture-specific schemata in L2 comprehension, Selected Papers from the Ninth Illinois TESOL/BE Annual Convention, the First Midwest TESOL Conference, 123-132.

Carrell, P. L. (1983). Some issues in studying the role of schemata or background knowledge in second language comprehension, Reading in a Foreign Language, 1(1), 81-92.

Chall, J. & Dale, E. (1995). Readability Revisited: The new Dale-Chall Readability Formula.

Brookline Books.

Chang, T.-H., Lee, C.-H., & Chang, Y.-M. (2006). Enhancing automatic Chinese essay scoring system from figures-of-speech, Proceedings of the Twentieth Pacific Asia Conference on Language, Information and Computation, 28-34.

Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing text readability using cognitively based indices, TESOL Quarterly, 42(3), 475-493.

Flesch, R. (1948). A New Readability Yardstick, Journal of Applied Psychology, 32(3), 221-233.

Huang, C.-S., Kuo, W.-T., Lee, C.-L., Tsai, C.-C., & Liu, C.-L. (2010). Using linguistic features to classify texts for reading comprehension tests at the high school levels, Proceedings of the Twenty Second Conference on Computational Linguistics and Speech Processing (ROCLING XXIII), 98-112. (in Chinese)

Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Technical Report Research Branch Report, 8-75.

Kuo, W.-T., Huang, C.-S., Lai, M.-H., Liu, C.-L., & Gao, Z.-M. (2009). 適用於中學英文閱 讀測驗短文分類的特徵比較, Proceedings of the Fourteenth Conference on Artificial Intelligence and Applications. (in Chinese)

Lin, S.-Y., Su, C.-C., Lai, Y.-D., Yang, L.-C., & Hsieh, S.-K. (2009). Assessing text readability using hierarchical lexical relations retrieved from WordNet, International Journal of Computational Linguistics and Chinese Language Processing, 14(1), 45-84.

MOE. (2008). http://www.edu.tw/eje/content.aspx?site_content_sn=15326

218 Wei-Ti Kuo et al.

Shih, R. H., Chiang, J. Y., & Tien, F. (2000). Part-of-speech sequences and distribution in a learner corpus of English, Proceedings of Research on Computational Linguistics Conference XIII (ROCLING XIII), 171-177.

Witten, I. H. & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.

Zhang, X. (2008). The effects of formal schema on reading comprehension – An experiment with Chinese EFL readers, International Journal of Computational Linguistics and Chinese Language Processing, 13(2), 197-214.

 

相關文件