Chapter 5 Token-level Chinese WUE Detection
5.7 Results and Analysis
Table 5-2 Performance of the LSTM/Bi-LSTM sequence labeling models with different
sets of features.
5.7.1 Overall Results
Table 5-2 shows the performance of our WUE detection models with different input features. The random baseline is a system randomly choosing one token as the incorrect position. The LSTM model using only randomly initialized word embeddings largely outperforms the random baseline. The pre-trained CBOW/SG word embeddings seem not very useful, leading to detection performance slightly lower than the model with random initial word embeddings. For both CBOW and SG, introducing the POS sequence improves the detection accuracy by about 2% and also improves all other measurements.
The n-gram features further increase the accuracy by about 1%.
On the other hand, the CWIN and Struct-SG embeddings themselves are very powerful. Incorporating the POS and n-gram features leads to only slight improvements in terms of accuracy. Despite the small impact on accuracy, the n-gram features bring obvious improvements on hit@2 and hit@20% rates, indicating that they do facilitate the model in promoting the rank of the ground-truth position. Under the same set of features, all models with CWIN/Struct-SG significantly outperform their CBOW/SG counterparts (𝑝 < 0.05).
Bidirectional LSTM (LSTM) further enhances the performance of LSTM. Bi-LSTM with CWIN+POS features achieves the best accuracy and MRR, and significantly outperforms its LSTM counterpart (𝑝 < 0.005). Bi-LSTM with CWIN+POS+n-gram
features achieves the best Hit@2 and Hit@20%.
5.7.2 LSTM v.s. Bi-LSTM on Segments with Different Length
Length (# tests) # proposed LSTM Bi-LSTM
< 10 (645) 1 0.7426 0.7659
10 ~ 14 (317) 2 0.6908 0.7319
15+ (89) 3+ 0.7416 0.7079
Table 5-3 Hit@20% rates of LSTM and Bi-LSTM on segments with different lengths.
To take a closer look at what is gained by moving from LSTM to bidirectional LSTM, we analyze the performance of the two types of models on different length of segments in Table 5-3. We use the versions with all set of features and report hit@20% rates. Using Bi-LSTM leads to some improvement on short (< 10 tokens) segments, and larger improvement on mid-length (10 ~ 14 tokens) ones. Even longer (15 tokens up) segments are relatively rare since foreign learners seldom construct complex sentences.
5.7.3 Relationship between Top Two Candidates
We previously justify the use of the hit@2 metric by pointing out that a WUE usually involves a pair of words dependent on each other. We can verify whether the top two candidates proposed by our model are closely related by examining the dependency distance. We take the output of the Bi-LSTM model with CWIN+POS+ngram features and analyze the error cases where the model ranks the ground-truth error position second.
We use the dependency parsing result to construct an undirected graph, where each dependency corresponds to an edge, and calculate the shortest distance between the top two candidates in these cases. Figure 5-3 shows the undirected dependency graph of segment (S5-2). The shortest path distance between token “知識” and “差” is 1.
Dependency Parsing Result Undirected Dependency Graph relcl(知識, 學習)
mark(學習, 的) nsubj(差, 知識) advmod(差, 也) advmod(差, 很)
Figure 5-3 An example of undirected dependency graph.
The dependency analysis results are summarized in Table 5-4. 𝑎 denotes the ground-truth error position. 𝑐1 and 𝑐2 denote the first and the second candidate positions proposed by the model. 𝑑𝑖𝑠(𝑐1, 𝑐2) is the distance between 𝑐1 and 𝑐2 on the dependency graph. The average distance (2.07) is small compared to the average length of the segments (9.24), indicating that our model can consider the dependencies among words when ranking the candidate positions. Moreover, among the 𝑐2 = 𝑎 cases, more than one third of them have 𝑑𝑖𝑠(𝑐1, 𝑐2) = 1. This means that the top two candidates are closely related as in the case of “知識” and “差” in (S5-2).
# correct (𝑐1 = 𝑎) 520 (49.48%)
# tests where 𝑐2 = 𝑎 339 (32.25%)
Average 𝑑𝑖𝑠(𝑐1, 𝑐2) when 𝑐2 = 𝑎 2.07
# tests where 𝑐2 = 𝑎 and 𝑑𝑖𝑠(𝑐1, 𝑐2) = 1 129 (12.27%)
Table 5-4 Summary of the analysis of the dependency between the top two candidates.
5.7.4 Effectiveness of POS Features
POS (# tests) CWIN CWIN+POS
VV (325) 0.8123 0.8185
NN (282) 0.6879 0.7447
AD (134) 0.6194 0.7015
Table 5-5 Hit@20% rates of Bi-LSTM models with or without POS features on three most frequent POS tags of the erroneous token.
A factor that might limit the effectiveness of POS features is that the POS tagger trained on well-formed text may not perform well on noisy learner data. In fact, for 26.7%
of the test data, the POS tag of the original erroneous token differs from that of its corrected version. We compare the performance of Bi-LSTM with or without POS features on three most frequent POS tags in Table 5-5. As can be seen, the POS information of the erroneous segment, which potentially contains errors, can still be helpful for detecting anomaly of the segment. In example (S5-6) we show the scores of incorrectness predicted by models with or without POS features. The ”DEC + AD”
POS information is available.
(S5-6) 應該 有 別人 的 *盡力
POS tag VV VE NN DEC AD
w/o POS 0.048 0.226 0.030 0.016 0.042
w/ POS 0.010 0.066 0.031 0.071 0.077
( There should be someone else’s *utmost. )
5.7.5 Effectiveness of N-gram Features
Though being very powerful in the segment-level detection task, the Google n-gram features only lead to small improvement in accuracy and MRR according to Table 5-2. In the case of Bi-LSTM with CWIN+POS features, including n-gram features even results in slightly lower accuracy and MRR.
We find out that a segmentation error somewhere in the segment can affect the model’s decision. For example, in (S5-7) the token “就在這時候” is an incorrect merge of several words. The Bi-LSTM model with CWIN+POS selects the token “起來”, which is the ground-truth error position, but the model with CWIN+POS+n-gram selects the OOV token “就在這時候”. Note that “起來” is still given second highest incorrectness score, which may explain the improvement on the hit rates after incorporating n-gram features.
(S5-7) 他 就在這時候 想 *起來 一 個 辦法
OOV 0 1 0 0 0 0 0
CWIN+POS 0.0228 0.0860 0.0466 0.5440 0.0105 0.0712 0.0885 + n-gram 0.0488 0.2677 0.1427 0.2540 0.0150 0.1158 0.0319
(At this moment, he thought *of a way.)
Even when there is no segmentation error, making decision based on n-gram probabilities is still rather complex. We use (S5-8) as an example.
(S5-8) 哪個 城市 都 有 特色 的 氣氛 2gram -1 0.0139 0.0037 0.0544 0.0013 0.1020 0.0002 3gram -1 -1 0.0317 0.2462 0.0002 0.3981 0.000039
(Every city has *specialty atmosphere.)
Regarding bigram probabilities, the erroneous token “特色” is not the one given lowest probability. Instead, the bigram “的 氣氛” has the lowest probability, since many words can be the succeeding word of “的”. Thus, the probability of a certain possibility,
“氣氛” in this case, would be low. A low probability does not indicate a wrong usage when involving such a “general” word “ 的”, so making decision simply based on probability threshold would not work. It is expected that the WUE in (S5-8) can be recognized if we consider trigrams, since “特色 的 氣氛” is a local wrong usage.
However, the trigram probability does not drop until the last token “氣氛”, which makes it difficult to trace the problem to the inappropriate usage of “特色”.
Although the model has access to word and POS information, which can be used for determining different thresholds for different situations, it seems that the model is better at matching word and POS patterns where the Chinese learners are more likely to make mistakes.
In sum, if a segment contains some WUEs, it is more likely to have low overall
n-level detection task. Nevertheless, when the model needs to determine the error position, the noise from segmentation error may lead to wrong results. Moreover, it can be complex to interpret the n-gram probabilities and trace which token causes the WUE.
5.7.6 Performance on Commonly Misused Words
Word Error rate Precision Recall
產生 (generate) 0.571 (8/14) 0.700 (7/10) 0.875 (7/8) 經驗 (experience) 0.500 (5/10) 0.667 (4/6) 0.800 (4/5) 發生 (happen) 0.455 (5/11) 0.571 (4/7) 0.800 (4/5) 而 (so) 0.417 (20/48) 0.550 (11/20) 0.550 (11/20)
Table 5-6 Precision/recall on four most commonly misused words.
In Table 5-6 we show the precision/recall of the Bi-LSTM model with CWIN+POS features on four most commonly misused (error rate above 0.4) words. The error rate of a word w is calculated on the test set by
err_rate(𝑤) =# segments in which 𝑤 is misused
# segments containing 𝑤
We exclude words that occur in less than 10 segments regardless of their error rates. In general, our model achieves high recall and fair precision. Discriminating correct and wrong usage of the conjunction 而(so), which often connects more than one segment, seems to be the most difficult. For example, in (S5-9) the inappropriateness of 而 cannot be recognized unless we consider the wider context of this segment.
(S5-9) (*而,並) 當成此生做人的道理
( ..., (*so,and) take it as a lifelong way to behave around others. )