In this section, the proposed features and model of the proposed reading difficulty
estimation are evaluated. To determine how each feature contributes to an accurate
readability judgment, we first conducted an experiment to test the performance of each
feature and feature categories using linear regression algorithms. Next, an optimal
fea-ture set is determined by model selection, in order to investigate how to best combine
the features that improve reading difficulty estimation. Finally, the proposed estimation
was also modeled as a multiclass classification and compared to other related work.
These experiments are described in the following subsections.
5.1 Data set
Our experiment used data from senior high school English textbooks designed for
Chinese students in Taiwan to learn English as a foreign language. We gathered 342
documents from five different publishers (including The National Institute for
Compi-Book Company, and Nan-I Publishing Company). Poets and scripts were excluded
from the data set because their formats are different from normal articles. Each
docu-ment was graded using six levels, ranging from one to six. These levels indicate the
semester grade levels of senior high school students, and were used as the gold
stand-ard in this work.
During data preprocessing, a majority of words were adopted directly and stop
words were also used. Even though some words are derivative of the same root, they
are acquired and used in different ages. For example, in our scenario, the word promise
is taught in the second semester, while the word promising is in the fourth semester.
This suggests that the different forms of words are represented as different meanings.
Thus, initially, words were used directly in our study; otherwise, some words were
lemmatized when necessary. For the same reasons, stop words are retained in the
pro-posed estimation except for frequency features.
5.2 Metrics
In the evaluation of features and the optimal model selection, the root mean
squared error (RMSE) and Pearson’s correlation coefficient (r) were used in the
ex-periments in order to evaluate the effectiveness of the estimated reading difficulty.
RMSE measures the averaged erroneous value between ground truths and estimated
responses. It is an averaged distance for measuring how far estimated responses
ap-proach ground truths; the lower the RMSE, the better the estimation. The Pearson’s
correlation coefficient (r) measures the trends between the ground truth and the
gener-ated results. It represents the strength of the linear relationship between two random
variables. A high correlation shows that simple documents are estimated as having a
low difficulty value, while difficult documents are predicted as having a high difficulty
value.
In the evaluation of reading difficulty estimation as classification, not only the
RMSE and correlation coefficient described in the previous section, but also accuracy
and trend accuracy in direction (TAD) are adopted as measurements in the evaluation.
Accuracy is defined as the proportion of correctness of the generated results comparing
with the ground truth. The TAD is used in the performance of trend forecasting (Zhang
et al. 2005). The result can be interpreted as the proportion of the same direction
be-tween the estimated results and the gold truth. To employ the measurement, the TAD
5.3 Evaluation of the features
Five-fold cross-validation was employed. The data was first split into five sets.
One set was used as held-out data to predict the reading difficulties of documents,
while the rest of the data was used as training data to build a regression model. Each
fold was further repeated five times by changing the pairing of training and testing.
To understand the impact of features and feature categories, we reported the
RMSE scores and the correlations among different feature categories. Table 3
summa-rizes RMSE and the correlations among different feature categories. The rows of a
block represent the different feature categories in Section 5.1. The baseline features
have the best results, leading to an RMSE of 0.98 and correlation of 0.82. This was
composed of word number, sentence length, and syllables, and indicates that they
rep-resent how a majority of experts design learning materials.
Table 3 Results of RMSE and correlation among different feature categories.
Categories Features RMSE r
Baseline baseline-only 0.98 0.82
Word (AOA) gept-only 1.24 0.68
Word (AOA) vq-only 1.25 0.67
Relation coreference-only 1.48 0.48
Parse parse-only 1.50 0.46
Grammar (AOA) grammar-only 1.61 0.31
Semantic wordnet-only 1.60 0.33
Frequency bnc_freqency 2.22 0.12
Frequency google_search_count 4.50 -0.04
The second and third feature categories were GEPT and VQ, both of which are
the word acquisition grade distributions features. They represent when non-native
readers learn words at different ages. It is highly likely that the word acquisition grade
distributions is also an important factor in analyzing reading difficulty for non-native
readers. Even though GEPT and VQ features are derived from different resources and
classified by fine-coarse grade respectively, both of their performances were high
cor-related with ground truth and the RMSE values were less or equal to 1.25. Surprisingly,
the GEPT word list, only divided into three levels, performs better than VQ, which
categorizes words acquired by non-native readers from elementary schools to
universi-ties.
The fourth feature category was coreference features. The coreference features
captured the inter-relationship between noun phrases. While baseline, GEPT, and VQ
features only took explicit words into consideration, coreference features considered
not only noun phrase types but also implicit interaction between noun phrases. Feng,
Jansche, Huenerfauth, and Elhadad (2010) first proposed coreference and discourse
features. In their investigation, noun phrase features and coreference inference features
slightly improved. The result was fairly consistent in the second-language materials.
The fifth and sixth features were parse and grammar features. Both features
ana-lyze sentence structures in a document. While parse features were automatically
ex-tracted from a parser, grammar features were identified from a tree structure search
tool based on manual grammatical patterns. The results of our grammar features were
consistent with Heilman et al. (2007). In their study, vocabulary-based features
pro-duced more accurate results than grammar-based features alone, but complex grammar
features performed better than simple ones. Even though we collected more than forty
grammatical patterns from six grades in textbooks (more than the Heilman method),
the results indicated that parse features were slightly better than grammar features. It is
possible that parse features could be more robust than grammar features.
The next features are semantic features. Unfortunately, calculating the number of
senses for each word seemed to have little impact on reading difficulty estimation. No
matter how many senses a word has, the most important factor is whether the readers
understands the specific sense of words in a document or not. The better solution might
first determine each sense of word in a document, and then assess when reader had
learned those meanings. This may represent a new research problem for future studies.
The last remaining features were frequency features derived from the BNC
cor-pus and Google search engine. Surprisingly, frequency features were not good
indica-tors for estimating reading difficulty. This went against Tanaka-Ishii, Tezuka, Terada
(2010), who used the log frequency obtained from corpora as features to predict
docu-ment reading difficulty. One explanation for this is that the format of features and the
method in (Tanaka-Ishii et al., 2010) may be very different from this study. Another
explanation is the possibility that lower and higher word frequencies are counteracted
by the summarization of word frequencies.
The results indicated that the lexical-based feature categories (the baseline and
the word acquisition grade distributions features) produced more accurate results than
grammar-based feature categories (coreference, parse and grammar features) alone.
5.4 Optimal model selection
To investigate how combining features improves reading difficulty estimation, the
forward selection were used to select the best subset of features for linear regression,
and Bayesian information criterion (BIC; Schwarz, 1978) were applied to decide the
best regression model. If a regression model employs every available feature (47 in
total), it becomes sensitive to training data. In contrast, if a model was not designed
well, its performance with the testing data should be poor. This section examines how
this study identified an appropriate model with features that play important roles in
de-termining reading difficulty.
The forward selection was employed to evaluate the optimal model; it starts with
the intercept and adds at each step the features that most improve. The detailed rules
for this process are as follows:
Step 1. The first feature with the highest Pearson Correlation Coefficient value was
selected to the best model.
Step 2. The next selected feature had the highest semi-partial correlation, and is
added into the model.
Step 3. After adding the new feature in step 2, the squared multiple correlation
coef-ficient of new the model was calculated (R2).
Step 4. To test whether the new feature contributes significantly to the model, the
difference between the new R2 value and old R2 value was evaluated using
Analy-sis of Variance (ANOVA).
Step 5. If the incremental difference in Step 4 significantly improved, the new
fea-ture stayed in the model; otherwise it was removed.
This process from step 2 to step 5 is repeated until the addition of further features
produces no significant improvement.
Bayesian information criterion was used to select the best model based on
esti-mating the Kullback-Leibler divergence between a true model and a proposed model,
incorporating sample size. This process also introduces a penalty term for the number
of parameters in a model. BIC is denoted as:
ln(RSS) ln( )
BIC n n k
= × n + × (18)
where RSS is the residual sum of squares from the regression model, k denotes the
nalize complex models, giving preference to simpler models in selection.
Table 4 summarizes the top performance of each selective model and the full
model. As shown in the second row of the table, the number of words in a document
was the first feature with the highest validity, and thus this feature was involved in the
first model. The remaining features are added in turn to the model, according to their
significant individual contributions, as described in the second column of
Table 4. As shown, when gept1 was added to the model, the results greatly
im-proved; the RMSE of the second model dropped to 0.96 and the correlation rose to
0.82. This represents a positive contribution for the word acquisition grade
distribu-tions. Thus, these results imply that when given a document, the estimated level can be
accurately calculated by using the number of words and the proportion of the gept1
word list combined into the regression model.
Table 4 Results of the optimal model selection.
Model Added Feature RMSE r BIC RSS
1 word_number 1.25 0.67 157.50 532.87
2 gept1 0.96 0.82 -20.19 311.58
3 tree_height 0.87 0.86 -87.77 251.39
4 vq12 0.85 0.86 -99.52 238.79
5 vq13 0.84 0.87 -106.52 229.99
6 proper_noun 0.84 0.87 -104.46 227.47
7 vq15 0.84 0.87 -101.14 225.80
8 vq5 0.85 0.86 -97.86 224.12
9 antecedent 0.85 0.86 -94.88 222.26
10 vq11 0.85 0.86 -91.40 220.74
all 1.51 0.64 121.91 208.15
Based on BIC values, the best model was a combination of the following features:
the number of words, gept1, tree height, vq12 and vq13. Our results show that the fifth
model in
Table 4 had the least difference between the gold truth and the estimated levels:
the RMSE is as small as 0.84 and the correlation turned out to be closer to 1 at 0.87.
With the other features added, the performance remained steady until the seventh
mod-el. After that, the performance began to decrease. This finding suggests that the word
acquisition grade distributions and the average complexity of sentence structures are
important factors in reading difficulty and should be taken into consideration. Except
for the average tree height, the gept1 word list refers to words that users have already
learned, while the vq12 and vq13 word lists contain vocabulary that are currently being
acquired, corresponding to the specific readers’ ability. From these results, we
con-clude that for non-native readers, previously learned vocabulary, current new
vocabu-lary, and the complexity of sentence structure lead to a successful reading difficulty
estimation.
To better compare these potential models, the performance of selected models is
presented in Figure 5. The upper half of the figure illustrates the RMSE among the
models with increased feature numbers, while the lower half of the figure shows the
correlation between the models and the ground truth. Initially, the RMSE value
de-creases and correlation sharply rises as features are added. After identifying the most
accurate model, the performance of both measures levels off. This can be seen as a
great advantage of model selection, since a small number of identified features
achieves a satisfying outcome. Until the Google search result variable was added to the
model (the 29th model), the RMSE rapidly increased and the correlation significantly
declined. This implies that frequency in a large corpus, such as Google, might not be as
useful as the word acquisition grade distributions in reading difficulty estimation.
These results reinforce the assumptions of previous studies (Huang, Chang, Sun and
Chen, 2011), where the word acquisition grade distributions has more relative
im-portance than frequency within corpora. After the 29th model, the performance
fluctu-ated and worsened, compared to previous models in both measurements. This indicates
that performance becomes unstable if the model over-fits. This error may be due to the
fact that these models capture idiosyncrasies of the training data rather than
generali-ties.
Figure 5 the performance of a selected model.
To understand the impact of feature sets, we also investigated the performance of
several regression algorithms with two different feature sets. These algorithms are
available in the WEKA package (Bouckaert, Frank, Hall, Holmes, Pfahringer,
Reute-mann, … and Sonnenburg,, 2010), including Support Vector Regression (SVR;
EL-Manzalawy and Honavar, 2005), Sequential Minimal Optimization for Regression
(SMOreg; Shevade, Keerthi, Bhattacharyya and Murthy, 2000), Pace Regression
(Wang and Witten, 2002), and linear regression. All parameters were used as a default
setup. Adopting regression as a reading difficulty model assumes that the output scores
are continuous and related with each other. Table 5 shows the results of these
regres-sions. All methods with only the optimal features outperformed those with all features.
This indicates that the optimal feature set could help regression estimates. In addition,
the results of the linear regression outperformed those of SVR. This finding is in
con-trast with Kate et al. (2010), which noted similar performance among regression
algo-rithms. SMOreg, which is a SVR improved by Sequential Minimal Optimization, had
the best performance among the models with all features; however, linear regression
with optimal features matched the results of SMOreg and Pace Regression. This
sup-ports our contribution; when the optimal feature set is identified, the performance
among various regression techniques is similar.
Table 5 Results of the optimal model selection.
Method
All Features Optimal Features
RMSE r RMSE r
Support Vector Regression (nu-SVR) 1.66 0.22 1.24 0.60
Support Vector Regression (epsilon-SVR) 1.65 0.25 1.42 0.59
SMO for Regression (SMOreg) 1.25 0.73 0.84 0.87
Pace Regression 1.95 0.52 0.84 0.87
Linear Regression (Proposed) 1.51 0.64 0.84 0.87
6.5 Reading difficulty estimation as classification
The proposed model can also be modeled as a multiclass classification. The labels
were determined by eight levels ranging from zero to seven: zero represents the
docu-ments under the specific readers’ ability such as elementary textbooks; seven
repre-sents the documents above the specific readers’ ability such as college textbooks; the
remaining levels are the same as the semester grades of senior high school. First, the
thresholds were found and the estimated levels were assigned to the closest level
cor-responding to the threshold. The minimum threshold was assigned the minimum value
from the training dataset; likewise, the maximum threshold was selected based on the
maximum value from the training dataset.
To understand the performance of the proposed estimation compared with other
studies, our experiment also compared the estimated levels within the Flesch Reading
Ease (Flesch 1948), Flesch–Kincaid Grade Level (Kincaid et al. 1975), Coleman-Liau
(Coleman and Liau 1975), Lexile (Stenner 1996), and the Heilman method (Heilman et
al. 2007). The Flesch–Kincaid Grade Level, Flesch Reading Ease and Coleman-Liau
were duplicated, while Lexile and the Heilman method are available online. All of
these methods are designed for native readers. In the training phase, the output score
generated from each document by those estimations is found, like the procedure of the
proposed method, as well as the threshold between each level of the other estimations.
During the testing phase, the estimated levels of testing documents were determined
using these thresholds.
For accuracy and RMSE, we expect that the proposed estimation will obviously
produce a more accurate reading difficulty prediction than other estimations. Through
TAD, we expect that the proposed estimation will be consistent with the ground truth,
although it might tend to predict easy documents with a lower grade and difficult
documents with higher grades. In the correlation coefficient, we expect that the
pro-posed estimation will report a particularly high correlation than other estimations. This
may suggest that the relationship between the proposed estimation and the ground truth
is stronger than others. In summary, existing difficulty estimation methods will
per-form poorly for second language learners, which may due to the different and
insuffi-cient features used.
Table 6 shows the results between the proposed estimation and other estimations.
For accuracy and RMSE, the proposed estimation obviously produced a more accurate
reading difficulty prediction than other estimations. When the proposed estimation fails
to predict the correct reading difficulty, its error ranges are almost within one grade; by
comparison, the error ranges of Flesch–Kincaid Grade Level, the Lexile and the
Heil-man method were between one to two grades, and Flesch Reading Ease and ColeHeil-man–
Liau had an even wider error range. Through TAD, the proposed estimation was
con-sistent with the ground truth, although it might tend to predict easy documents with a
lower grade and difficult documents with higher grades. In contrast, the results of the
other method are fluctuant. In the correlation coefficient, all estimations are positively
correlated. The proposed estimation reported a particularly high correlation at 0.87,
whereas the other estimations were at <0.5. This suggests that the relationship between
the proposed estimation and the ground truth is stronger than others. In summary,
ex-isting difficulty estimation methods perform poorly for non-native readers, which may
due to the different and insufficient features used.
Table 6 Comparison between the estimations.
Estimations RMSE r Accuracy TAD
Flesch Reading Ease 2.17 0.27 0.28 0.40
Flesch–Kincaid Grade Level 1.85 0.48 0.26 0.49
Coleman–Liau 2.16 0.31 0.24 0.41
Heilmen 1.84 0.41 0.26 0.43
Lexile 1.76 0.46 0.33 0.49
Model 5 =
word_number+gept1+tree_height+vq12+vq13
1.01 0.87 0.42 0.68