Multivariate Analysis by Language Models - 漢語問句偵測之量化研究

In order to reduce possible sampling errors, when there is a need to divide the dataset into a training and a test set, we select a simple random sample (SRS) of a given ratio 1/r as the test set; the remaining is used for training.

Figure 14 illustrates the overall flow of dataset preparation. At this stage, we collect all clauses that pass the univariate and bivariate rules. They are, of course, composed of both true and false positives. Then we divide them into a question set S⁰Q and a non-question set S⁰NQ. Each one is further divided into a training set and a test set by SRS. Now our training process will focus on the two training sets: the question training set S⁰_Q,tr and the non-question training set S⁰_NQ,tr.

Next, at the training stage, we train a pair of competitive language models for both S⁰_Q,tr and S⁰_NQ,tr. Let’s call them LM_Q and LM_NQ respectively.

corpus S

univariate

& bivariate rules

sub-corpus S 0

S_Q⁰ S_NQ⁰

S 0

S 0 Q,tr

Q,te S 0

NQ,tr S 0 NQ,te

training set passed through

simple random sampling

1-1/ r 1-1/ r

1/ r 1/ r

test set

filtered out

Figure 14: Prepare training and test sets by simple random sampling

Finally, let’s take a look at the detection stage.

Traditionally, perplexity (or more precisely, cross-perplexity) is used as a mea-sure of how close a language model is to its theoretically perfect model. Let two candidate language models LM₁ and LM₂ be constructed with the same training set and then evaluated with the same test set. We say that LM₁ is, with regard to the perfect model, better at modeling the dataset than LM₂ is if perplexity values p₁ < p₂, and vice versa. The concept is illustrated in Figure 15a.

Now let’s reverse the evaluation direction. Given a sentence s, it is evaluated by both LM_Q and LM_NQ, and two perplexity values p₁ and p₂ will be generated, respectively. Assume that both LM_Q and LM_NQ are good approximation to their perfect models. Since perplexity can be considered a measure of how close a language model is to s, it follows that if p₁ < p₂, the LM_Q is a better match for s than LM_NQ, and vice versa. Therefore we use the preplexity as a criterion to classify the s into a question (modeled by LM_Q) or a non-question (modeled by LM_NQ). The concept is illustrated in Figure 15b.

This approach works under the assumption that both LM_Q and LM_NQ are good approximation to their perfect models. It follows that the performance of this apporoach would rely on how good the language models are and how likely they will discriminate between question and non-question cases. Here we consider two types of language modeling techniques. The first is a trigram model with Good-Turing discounting and Katz backoff for smoothing (see [41, Chapter 6] and [28, Chapter 6] for more details). The second is an interpolated smoothing model since it has been reported in [6, 25] that interpolated Kneser-Ney smoothing (including higher-order n-gram models, especially 5-gram) performs better than many others in every situation they have examined. Whenever possible, we experiment with three configurations: trigram, 4-gram, and 5-gram.

There are still some variation of details that need consideration when con-structing the language models. Here we consider two possible variations: tag vs.

word and tag unification.

LM₁

LM₂

p₂ training

s: w₁ w₂ w₃ ...

p₁

w₁ w₂ w₃ ...

(a) Traditional use of language models.

LM_Q

LM_NQ

p₂ training with Q

s: w₁ w₂ w₃ ...

p₁

w₁ w₂ w₃ ...

training with NQ

w₁ w₂ w₃ ...

(b) Our approach to using language models.

Figure 15: Using language models to discriminate questions

Table 12: Different configurations used in our language modeling experiments Good-Turing/Kats Interpolated Kneser-Ney

Dataset trigram trigram 4-gram 5-gram

word GT-w IKN3-w IKN4-w IKN5-w

tag GT-t IKN3-t IKN4-t IKN5-t

tag unification GT-tx IKN3-tx IKN4-tx IKN5-tx

Data sparseness causes problems in nearly every language model technique.

Since a training set is more sparse when it is composed in terms of a series of words than when it is composed in terms of a series of POS tags, we suspect if the language model constructed in terms of POS of words is better than the one in terms of words themselves. Therefore, both approaches will be used for comparison.

In addition, at times the Sinica corpus assigns different POS tags to the same type of univariate features. Take “A-not-A” words for example, “}.}” is as-signed a D tag while “ß.ß” VH. As a consequence, we suspect if it is inappropriate to train the language models in terms of the original tagset assignment of the cor-pus. To verify this, we will conduct a pair of experiments to see if there is any performance difference by unifying a variety of such tags into a single one (let’s name it “XXX” tag for convenience).

Putting them together, we will experiment with several kinds of configuration, as summarized in Table 12.

Finally, the performance of language models depends on the selection of train-ing and test sets. Therefore, the whole SRS-division/traintrain-ing/evaluation process is repeated n times (e.g., n = 20) to gain a better feeling of stability of this approach with regard to different training/test configuration.

6.2.1 Particles and Interjections

As stated in Section 5.2.1, some sentence-final particles and interjections perform not only question but also euphemism, irony, exclamation, or any other illocution-ary act. Since linguists disagree with qualitative analysis and explanation of the

Figure 16: The result of using language models to discriminate the case of sentence-final particles. Since all IKNn-w runs produce the same outcome, only IKN5-w is shown in this figure; the same for IKN5-t.

In the “Before” cases, average precision = 46.69%, and standard deviation = 1.09.

In the GT-t and IKN5-t cases, average precision = 66.56%, and standard deviation

= 1.14. In the GT-w and IKN5-w cases, average precision = 77.40%, and standard deviation = 0.80.

precise way to distinguish between them, we will try another quantitative route to this.

The outcome of 20 experiments is shown in Figure 16. On average, precision increases from 46.69% to 66.56% when undertaking any language modeling tech-nique at tag level, and to 77.40% at word level. All language modeling techtech-niques we use at the same level have the same performance in the total 20 runs, though interpolated Kneser-Ney smoothing has lower average perplexity.

6.2.2 A-not-A Questions and Simplified Forms

The Sinica corpus assigns a variety of POS tags to different A-not-A words, e.g., } .}(D) and ß.ß(VH). Our treatment of A-not-A forms differs with that in the corpus (see Section 5.2.2). Our definition of A-not-A forms is also broader than that in the corpus (see Section 5.2.3). As a consequence, it may be inappropriate

Figure 17: The result of using language models to discriminate the case of A-not-A questions. Since all IKNn-w runs produce the same outcome, only IKN5-w is shown in this figure; the same for IKN5-t and IKN5-tx.

In the “Before” cases, average precision = 35.08%, and standard deviation = 1.60.

In the IKN5-w cases, average precision = 53.88%, and standard deviation = 3.48.

In the IKN5-t cases, average precision = 65.40%, and standard deviation = 2.97.

In the IKN5-tx cases, average precision = 67.19%, and standard deviation = 2.81.

A pairwise Student’s t -test on IKN5-t and IKN5-tx produces p = 0.00051 < 0.001, implying that there is a statistical significant improvement.

to train the language models in terms of the original tagset of the corpus. To verify this suspect, we will conduct a pair of experiments to see if there is any performance improvement by unifying a variety of A-not-A tags into a single one.

The outcome of 20 experiments is shown in Figure 17. Since all language modeling techniques used here at the same word or tag level produce the same outcome, we show only interpolated Kneser-Ney smoothing of order 5 (IKN5) for brevity. On average, precision increases from 35.08% to 53.88% when applying w, to 65.40% when applying t, and up to 67.19% when applying IKN5-tx. A pairwise Student’s t -test on the two language models IKN5-t and IKN5-tx produces p = 0.00051 < 0.001, implying that there is a statistical significant improvement by unifying a variety of A-not-A tags to an fixed artificial one.

6.2.3 WH Questions

As we have seen in Table 4 and Section 5.2.4, words of this type receive a variety of POS tags in the Sinica corpus, e.g., Bó(Nep), ÑBó(D), and 5óŸ(VH). As a consequence, it may be inappropriate to train the language models in terms of the tagset of the corpus. To verify this suspect, we conduct a pair of experiments to see if there is any performance improvement by unifying a variety of WH tags into a single one.

The outcome of 20 experiments is shown in Figure 18. Since all language modeling techniques used here at the same word or tag level produce the same outcome, we show only interpolated Kneser-Ney smoothing of order 5 (IKN5) for brevity. On average, precision increases from 41.23% to 69.52% when applying IKN5-w, to 72.93% when applying IKN5-t, and up to 73.97% when applying IKN5-tx. A pairwise Student’s t -test on the two language models IKN5-t and IKN5-tx produces p = 1.51 × 10⁻⁵ < 0.001, implying that there is a statistical significant improvement by changing the POS tags, though the improvement is only very small.

6.2.4 Evaluative Adverbs and Rhetorical Questions

As we have seen in Section 5.2.7, words of this type are mostly adverbs, if not all.

However, not all adverbs that appear in a sentence belong to this type. Again we wonder if it is better to train the language models with their POS tags unified into a single one to be distinct from other type of adverbs. To verify this, we will conduct a pair of experiments to see if there is any performance improvement.

The outcome of 20 experiments is shown in Figure 19. Since all language modeling techniques used here at the same word or tag level produce the same outcome, we show only interpolated Kneser-Ney smoothing of order 5 (IKN5) for brevity. On average, precision increases from 45.59% to 61.61% when applying w, to 64.46% when applying t, and up to 64.64% when applying IKN5-tx. A pairwise Student’s t -test on the two language models IKN5-t and IKN5-tx

35 40 45 50 55 60 65 70 75 80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Run

Precision (%)

IKN5-tx IKN5-t IKN5-w Before

Figure 18: The result of using language models to discriminate the case of WH questions. Since all IKNn-w runs produce the same outcome, only IKN5-w is shown in this figure; the same for IKN5-t and IKN5-tx.

In the “Before” cases, average precision = 41.23%, and standard deviation = 0.61.

In the IKN5-w cases, average precision = 69.52%, and standard deviation = 1.05.

In the IKN5-t cases, average precision = 72.93%, and standard deviation = 0.92. In the IKN5-tx cases, average precision = 73.97%, and standard deviation = 1.05. A pairwise Student’s t -test on IKN5-t and IKN-tx produces p = 1.51 × 10⁻⁵ < 0.001, implying that there is a statistical significant improvement.

35 40 45 50 55 60 65 70

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Run

Precision (%)

IKN5-tx IKN5-t IKN5-w Before

Figure 19: The result of using language models to discriminate the case of evaluative adverbs and rhetorical questions. Since all IKNn-w runs produce the same outcome, only IKN5-w is shown in this figure; the same for IKN5-t and IKN5-tx.

In the “Before” cases, average precision = 45.59%, and standard deviation = 2.29.

In the IKN5-w cases, average precision = 61.61%, and standard deviation = 3.15.

In the IKN5-t cases, average precision = 64.46%, and standard deviation = 2.11.

In the IKN5-tx cases, average precision = 64.64%, and standard deviation = 2.27.

A pairwise Student’s t -test on IKN5-t and IKN5-tx produces p = 0.357, implying that there is no statistical significant improvement.

produces p = 0.357, implying that there is no statistical significant improvement by unifying the POS tags. Therefore, it may be unnecessary to unify the POS tags for such cases. In addition, standard deviations are so large in all cases that there is still room for a deeper study.

CHAPTER VII

CONCLUDING REMARKS

在文檔中漢語問句偵測之量化研究 (頁 76-86)