Evaluation on reading difficulty estimation

In this section, the proposed features and model of the proposed reading difficulty

estimation are evaluated. To determine how each feature contributes to an accurate

readability judgment, we first conducted an experiment to test the performance of each

feature and feature categories using linear regression algorithms. Next, an optimal

fea-ture set is determined by model selection, in order to investigate how to best combine

the features that improve reading difficulty estimation. Finally, the proposed estimation

was also modeled as a multiclass classification and compared to other related work.

These experiments are described in the following subsections.

5.1 Data set

Our experiment used data from senior high school English textbooks designed for

Chinese students in Taiwan to learn English as a foreign language. We gathered 342

documents from five different publishers (including The National Institute for

Compi-Book Company, and Nan-I Publishing Company). Poets and scripts were excluded

from the data set because their formats are different from normal articles. Each

docu-ment was graded using six levels, ranging from one to six. These levels indicate the

semester grade levels of senior high school students, and were used as the gold

stand-ard in this work.

During data preprocessing, a majority of words were adopted directly and stop

words were also used. Even though some words are derivative of the same root, they

are acquired and used in different ages. For example, in our scenario, the word promise

is taught in the second semester, while the word promising is in the fourth semester.

This suggests that the different forms of words are represented as different meanings.

Thus, initially, words were used directly in our study; otherwise, some words were

lemmatized when necessary. For the same reasons, stop words are retained in the

pro-posed estimation except for frequency features.

5.2 Metrics

In the evaluation of features and the optimal model selection, the root mean

squared error (RMSE) and Pearson’s correlation coefficient (r) were used in the

ex-periments in order to evaluate the effectiveness of the estimated reading difficulty.

RMSE measures the averaged erroneous value between ground truths and estimated

responses. It is an averaged distance for measuring how far estimated responses

ap-proach ground truths; the lower the RMSE, the better the estimation. The Pearson’s

correlation coefficient (r) measures the trends between the ground truth and the

gener-ated results. It represents the strength of the linear relationship between two random

variables. A high correlation shows that simple documents are estimated as having a

low difficulty value, while difficult documents are predicted as having a high difficulty

value.

In the evaluation of reading difficulty estimation as classification, not only the

RMSE and correlation coefficient described in the previous section, but also accuracy

and trend accuracy in direction (TAD) are adopted as measurements in the evaluation.

Accuracy is defined as the proportion of correctness of the generated results comparing

with the ground truth. The TAD is used in the performance of trend forecasting (Zhang

et al. 2005). The result can be interpreted as the proportion of the same direction

be-tween the estimated results and the gold truth. To employ the measurement, the TAD

5.3 Evaluation of the features

Five-fold cross-validation was employed. The data was first split into five sets.

One set was used as held-out data to predict the reading difficulties of documents,

while the rest of the data was used as training data to build a regression model. Each

fold was further repeated five times by changing the pairing of training and testing.

To understand the impact of features and feature categories, we reported the

RMSE scores and the correlations among different feature categories. Table 3

summa-rizes RMSE and the correlations among different feature categories. The rows of a

block represent the different feature categories in Section 5.1. The baseline features

have the best results, leading to an RMSE of 0.98 and correlation of 0.82. This was

composed of word number, sentence length, and syllables, and indicates that they

rep-resent how a majority of experts design learning materials.

Table 3 Results of RMSE and correlation among different feature categories.

Categories Features RMSE r

Baseline baseline-only 0.98 0.82

Word (AOA) gept-only 1.24 0.68

Word (AOA) vq-only 1.25 0.67

Relation coreference-only 1.48 0.48

Parse parse-only 1.50 0.46

Grammar (AOA) grammar-only 1.61 0.31

Semantic wordnet-only 1.60 0.33

Frequency bnc_freqency 2.22 0.12

Frequency google_search_count 4.50 -0.04

The second and third feature categories were GEPT and VQ, both of which are

the word acquisition grade distributions features. They represent when non-native

readers learn words at different ages. It is highly likely that the word acquisition grade

distributions is also an important factor in analyzing reading difficulty for non-native

readers. Even though GEPT and VQ features are derived from different resources and

classified by fine-coarse grade respectively, both of their performances were high

cor-related with ground truth and the RMSE values were less or equal to 1.25. Surprisingly,

the GEPT word list, only divided into three levels, performs better than VQ, which

categorizes words acquired by non-native readers from elementary schools to

universi-ties.

The fourth feature category was coreference features. The coreference features

captured the inter-relationship between noun phrases. While baseline, GEPT, and VQ

features only took explicit words into consideration, coreference features considered

not only noun phrase types but also implicit interaction between noun phrases. Feng,

Jansche, Huenerfauth, and Elhadad (2010) first proposed coreference and discourse

features. In their investigation, noun phrase features and coreference inference features

slightly improved. The result was fairly consistent in the second-language materials.

The fifth and sixth features were parse and grammar features. Both features

ana-lyze sentence structures in a document. While parse features were automatically

ex-tracted from a parser, grammar features were identified from a tree structure search

tool based on manual grammatical patterns. The results of our grammar features were

consistent with Heilman et al. (2007). In their study, vocabulary-based features

pro-duced more accurate results than grammar-based features alone, but complex grammar

features performed better than simple ones. Even though we collected more than forty

grammatical patterns from six grades in textbooks (more than the Heilman method),

the results indicated that parse features were slightly better than grammar features. It is

possible that parse features could be more robust than grammar features.

The next features are semantic features. Unfortunately, calculating the number of

senses for each word seemed to have little impact on reading difficulty estimation. No

matter how many senses a word has, the most important factor is whether the readers

understands the specific sense of words in a document or not. The better solution might

first determine each sense of word in a document, and then assess when reader had

learned those meanings. This may represent a new research problem for future studies.

The last remaining features were frequency features derived from the BNC

cor-pus and Google search engine. Surprisingly, frequency features were not good

indica-tors for estimating reading difficulty. This went against Tanaka-Ishii, Tezuka, Terada

(2010), who used the log frequency obtained from corpora as features to predict

docu-ment reading difficulty. One explanation for this is that the format of features and the

method in (Tanaka-Ishii et al., 2010) may be very different from this study. Another

explanation is the possibility that lower and higher word frequencies are counteracted

by the summarization of word frequencies.

The results indicated that the lexical-based feature categories (the baseline and

the word acquisition grade distributions features) produced more accurate results than

grammar-based feature categories (coreference, parse and grammar features) alone.

5.4 Optimal model selection

To investigate how combining features improves reading difficulty estimation, the

forward selection were used to select the best subset of features for linear regression,

and Bayesian information criterion (BIC; Schwarz, 1978) were applied to decide the

best regression model. If a regression model employs every available feature (47 in

total), it becomes sensitive to training data. In contrast, if a model was not designed

well, its performance with the testing data should be poor. This section examines how

this study identified an appropriate model with features that play important roles in

de-termining reading difficulty.

The forward selection was employed to evaluate the optimal model; it starts with

the intercept and adds at each step the features that most improve. The detailed rules

for this process are as follows:

Step 1. The first feature with the highest Pearson Correlation Coefficient value was

selected to the best model.

Step 2. The next selected feature had the highest semi-partial correlation, and is

added into the model.

Step 3. After adding the new feature in step 2, the squared multiple correlation

coef-ficient of new the model was calculated (R²).

Step 4. To test whether the new feature contributes significantly to the model, the

difference between the new R² value and old R² value was evaluated using

Analy-sis of Variance (ANOVA).

Step 5. If the incremental difference in Step 4 significantly improved, the new

fea-ture stayed in the model; otherwise it was removed.

This process from step 2 to step 5 is repeated until the addition of further features

produces no significant improvement.

Bayesian information criterion was used to select the best model based on

esti-mating the Kullback-Leibler divergence between a true model and a proposed model,

incorporating sample size. This process also introduces a penalty term for the number

of parameters in a model. BIC is denoted as:

ln(RSS) ln( )

BIC n n k

= × n + × (18)

where RSS is the residual sum of squares from the regression model, k denotes the

nalize complex models, giving preference to simpler models in selection.

Table 4 summarizes the top performance of each selective model and the full

model. As shown in the second row of the table, the number of words in a document

was the first feature with the highest validity, and thus this feature was involved in the

first model. The remaining features are added in turn to the model, according to their

significant individual contributions, as described in the second column of

Table 4. As shown, when gept1 was added to the model, the results greatly

im-proved; the RMSE of the second model dropped to 0.96 and the correlation rose to

0.82. This represents a positive contribution for the word acquisition grade

distribu-tions. Thus, these results imply that when given a document, the estimated level can be

accurately calculated by using the number of words and the proportion of the gept1

word list combined into the regression model.

Table 4 Results of the optimal model selection.

Model Added Feature RMSE r BIC RSS

1 word_number 1.25 0.67 157.50 532.87

2 gept1 0.96 0.82 -20.19 311.58

3 tree_height 0.87 0.86 -87.77 251.39

4 vq12 0.85 0.86 -99.52 238.79

5 vq13 0.84 0.87 -106.52 229.99

6 proper_noun 0.84 0.87 -104.46 227.47

7 vq15 0.84 0.87 -101.14 225.80

8 vq5 0.85 0.86 -97.86 224.12

9 antecedent 0.85 0.86 -94.88 222.26

10 vq11 0.85 0.86 -91.40 220.74

all 1.51 0.64 121.91 208.15

Based on BIC values, the best model was a combination of the following features:

the number of words, gept1, tree height, vq12 and vq13. Our results show that the fifth

model in

Table 4 had the least difference between the gold truth and the estimated levels:

the RMSE is as small as 0.84 and the correlation turned out to be closer to 1 at 0.87.

With the other features added, the performance remained steady until the seventh

mod-el. After that, the performance began to decrease. This finding suggests that the word

acquisition grade distributions and the average complexity of sentence structures are

important factors in reading difficulty and should be taken into consideration. Except

for the average tree height, the gept1 word list refers to words that users have already

learned, while the vq12 and vq13 word lists contain vocabulary that are currently being

acquired, corresponding to the specific readers’ ability. From these results, we

con-clude that for non-native readers, previously learned vocabulary, current new

vocabu-lary, and the complexity of sentence structure lead to a successful reading difficulty

estimation.

To better compare these potential models, the performance of selected models is

presented in Figure 5. The upper half of the figure illustrates the RMSE among the

models with increased feature numbers, while the lower half of the figure shows the

correlation between the models and the ground truth. Initially, the RMSE value

de-creases and correlation sharply rises as features are added. After identifying the most

accurate model, the performance of both measures levels off. This can be seen as a

great advantage of model selection, since a small number of identified features

achieves a satisfying outcome. Until the Google search result variable was added to the

model (the 29th model), the RMSE rapidly increased and the correlation significantly

declined. This implies that frequency in a large corpus, such as Google, might not be as

useful as the word acquisition grade distributions in reading difficulty estimation.

These results reinforce the assumptions of previous studies (Huang, Chang, Sun and

Chen, 2011), where the word acquisition grade distributions has more relative

im-portance than frequency within corpora. After the 29th model, the performance

fluctu-ated and worsened, compared to previous models in both measurements. This indicates

that performance becomes unstable if the model over-fits. This error may be due to the

fact that these models capture idiosyncrasies of the training data rather than

generali-ties.

Figure 5 the performance of a selected model.

To understand the impact of feature sets, we also investigated the performance of

several regression algorithms with two different feature sets. These algorithms are

available in the WEKA package (Bouckaert, Frank, Hall, Holmes, Pfahringer,

Reute-mann, … and Sonnenburg,, 2010), including Support Vector Regression (SVR;

EL-Manzalawy and Honavar, 2005), Sequential Minimal Optimization for Regression

(SMOreg; Shevade, Keerthi, Bhattacharyya and Murthy, 2000), Pace Regression

(Wang and Witten, 2002), and linear regression. All parameters were used as a default

setup. Adopting regression as a reading difficulty model assumes that the output scores

are continuous and related with each other. Table 5 shows the results of these

regres-sions. All methods with only the optimal features outperformed those with all features.

This indicates that the optimal feature set could help regression estimates. In addition,

the results of the linear regression outperformed those of SVR. This finding is in

con-trast with Kate et al. (2010), which noted similar performance among regression

algo-rithms. SMOreg, which is a SVR improved by Sequential Minimal Optimization, had

the best performance among the models with all features; however, linear regression

with optimal features matched the results of SMOreg and Pace Regression. This

sup-ports our contribution; when the optimal feature set is identified, the performance

among various regression techniques is similar.

Table 5 Results of the optimal model selection.

Method

All Features Optimal Features

RMSE r RMSE r

Support Vector Regression (nu-SVR) 1.66 0.22 1.24 0.60

Support Vector Regression (epsilon-SVR) 1.65 0.25 1.42 0.59

SMO for Regression (SMOreg) 1.25 0.73 0.84 0.87

Pace Regression 1.95 0.52 0.84 0.87

Linear Regression (Proposed) 1.51 0.64 0.84 0.87

6.5 Reading difficulty estimation as classification

The proposed model can also be modeled as a multiclass classification. The labels

were determined by eight levels ranging from zero to seven: zero represents the

docu-ments under the specific readers’ ability such as elementary textbooks; seven

repre-sents the documents above the specific readers’ ability such as college textbooks; the

remaining levels are the same as the semester grades of senior high school. First, the

thresholds were found and the estimated levels were assigned to the closest level

cor-responding to the threshold. The minimum threshold was assigned the minimum value

from the training dataset; likewise, the maximum threshold was selected based on the

maximum value from the training dataset.

To understand the performance of the proposed estimation compared with other

studies, our experiment also compared the estimated levels within the Flesch Reading

Ease (Flesch 1948), Flesch–Kincaid Grade Level (Kincaid et al. 1975), Coleman-Liau

(Coleman and Liau 1975), Lexile (Stenner 1996), and the Heilman method (Heilman et

al. 2007). The Flesch–Kincaid Grade Level, Flesch Reading Ease and Coleman-Liau

were duplicated, while Lexile and the Heilman method are available online. All of

these methods are designed for native readers. In the training phase, the output score

generated from each document by those estimations is found, like the procedure of the

proposed method, as well as the threshold between each level of the other estimations.

During the testing phase, the estimated levels of testing documents were determined

using these thresholds.

For accuracy and RMSE, we expect that the proposed estimation will obviously

produce a more accurate reading difficulty prediction than other estimations. Through

TAD, we expect that the proposed estimation will be consistent with the ground truth,

although it might tend to predict easy documents with a lower grade and difficult

documents with higher grades. In the correlation coefficient, we expect that the

pro-posed estimation will report a particularly high correlation than other estimations. This

may suggest that the relationship between the proposed estimation and the ground truth

is stronger than others. In summary, existing difficulty estimation methods will

per-form poorly for second language learners, which may due to the different and

insuffi-cient features used.

Table 6 shows the results between the proposed estimation and other estimations.

For accuracy and RMSE, the proposed estimation obviously produced a more accurate

reading difficulty prediction than other estimations. When the proposed estimation fails

to predict the correct reading difficulty, its error ranges are almost within one grade; by

comparison, the error ranges of Flesch–Kincaid Grade Level, the Lexile and the

Heil-man method were between one to two grades, and Flesch Reading Ease and ColeHeil-man–

Liau had an even wider error range. Through TAD, the proposed estimation was

con-sistent with the ground truth, although it might tend to predict easy documents with a

lower grade and difficult documents with higher grades. In contrast, the results of the

other method are fluctuant. In the correlation coefficient, all estimations are positively

correlated. The proposed estimation reported a particularly high correlation at 0.87,

whereas the other estimations were at <0.5. This suggests that the relationship between

the proposed estimation and the ground truth is stronger than others. In summary,

ex-isting difficulty estimation methods perform poorly for non-native readers, which may

due to the different and insufficient features used.

Table 6 Comparison between the estimations.

Estimations RMSE r Accuracy TAD

Flesch Reading Ease 2.17 0.27 0.28 0.40

Flesch–Kincaid Grade Level 1.85 0.48 0.26 0.49

Coleman–Liau 2.16 0.31 0.24 0.41

Heilmen 1.84 0.41 0.26 0.43

Lexile 1.76 0.46 0.33 0.49

Model 5 =

word_number+gept1+tree_height+vq12+vq13

1.01 0.87 0.42 0.68

Chapter 6 Simulation on ability

在文檔中個人化電腦輔助出題於英文學習之研究 (頁 84-106)