• 沒有找到結果。

5.2 QRW Classification and Exploration

5.2.9 Roles

It seems that thematic roles and discourse functions also determine whether the person has remarkable influence. For instance, [7] suggested that if the subject of a main clause are one of the second person pronouns, the sentence tends to be

a real interrogative; if first person pronouns, non-interrogative. To see if there is really such a tendency towards the second person pronouns, we try to investigate in the Sinica treebank which roles are played more frequently by which pronouns.

Detailed description of the role scheme adopted in the treebank can be found in [31].

Since the Sinica treebank removes all punctuation (as mentioned in Section 4.2), our investigation is performed in three stages. The first stage is to find out all trees containing the 4 second person pronouns by the keyword-based search on the Sinica treebank Web site, and there are totally 1,227 trees found. The trees look like the following:

(25) S(evaluation:Dbb: ˝§|agent:NP(Head:Nhaa: 5)|

epistemics:Dbaa: u|reason:Dj: 5ó|

Head:VD1: }º|particle:Ta: í)

The second stage is then trying to extract semantic roles associated with each pronoun. To simply the task of tree parsing, by issuing the pattern “Head:Nh%”

with the “process again” and then the “filtering” command, the semantic role will be highlighted in red color as the following HTML code:

(26) S(evaluation:Dbb: ˝§|

<font color="#FF0000">agent:NP(Head:Nhaa: 5)</font>|

epistemics:Dbaa: u|reason:Dj: 5ó|

Head:VD1: }º|particle:Ta: í)

Even if some complicated trees are not explicitly highlighted in the same way, for example,

(27) VP(Head:VL4: é|

goal:NP(predication:VP *í (head:VP(quantity:Daa: É|

Head:V 2: |range:NP(property:A: GT|Head:Nab: W†))|

Head:DE: í)|Head:Nhaa: 5))

Table 11: Distribution of 2nd person pronouns and their respective semantic roles in the Sinica treebank. The “Q” columns list the numbers of question clauses for corresponding pronoun/role pairs, while the “¬Q” non-question clauses

5 5b g ]

the beginning of the result page still tells us that the role is “goal.”

The last stage is then trying to trace these trees back to their origins in the Sinica corpus, as has been presented in Figure 4. Totally 1,178 trees are suc-cessfully backtracked (success rate = 96%) and have their punctuation assigned accordingly. The remaining 4% of trees are dropped away here for the sake of objectivity. Finally the distribution of the 4 second person pronouns and their respective semantic roles is listed in Table 11.

Since the sample size of the pronoun “5” is quite large, it is safe to perform a χ2 test on the Q and ¬Q columns of it. The p-value is 1.752 × 10−6  0.01, indi-cating that in the “5” case there is a statistically significant relationship between the semantic roles and questions. On closer inspection, the largest component of χ2 is for the “Experiencer-Q” cell (16.996; see Figure 10), i.e., this combination contributes to the most to the overall distance χ2. Even if we regard the “expe-riencer” row as exceptional (outlier or contaminated), redoing the χ2 test on the same table except the “experiencer” row will produce the p-value = 0.0219 < 0.05, still indicating a significant evidence. Therefore, it is safe to conclude that different roles of “5” does have some remarkable influence on predicting questions.

How about the other 3 pronouns? The raw counts are too small to do the same χ2 test, but we can instead calculate the Q+¬QQ ratio (see Table 11 for statistics and Figure 12 for the boxplot) and then perform the ANOVA test on them. The one-way ANOVA test on the four Q+¬QQ columns generates the p-value = 0.31353,

Chi-Square Test: Q, not Q

Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts Q not Q Total

1 52 196 248

44.69 203.31 1.196 0.263

2 32 55 87

15.68 71.32 16.996 3.736

3 12 105 117

21.08 95.92 3.913 0.860

4 18 121 139

25.05 113.95 1.983 0.436

5 17 119 136

24.51 111.49 2.299 0.505

Total 131 596 727

Chi-Sq = 32.187, DF = 4, P-Value = 0.000

Figure 10: Minitab χ2 output for comparing the 5 roles of 2nd person pronoun

“5” for Table 11. The five rows are for Agent, Experiencer, Goal, Others, and Theme, respectively. The p-value is given as 0.000 here because the Minitab software rounds it to 3 decimal places, implying that p < 0.0005

One-way ANOVA: Ratio versus Role

Source DF SS MS F P

Role 4 0.0582 0.0145 1.30 0.314 Error 15 0.1675 0.0112

Total 19 0.2256

S = 0.1057 R-Sq = 25.78% R-Sq(adj) = 5.99%

Individual 95% CIs For Mean Based on Pooled StDev

Level N Mean StDev

-+---+---+---+---Agent 4 0.2451 0.0780 (---*---)

Experiencer 4 0.2197 0.1765 (---*---) Goal 4 0.1031 0.1208 (---*---)

Others 4 0.1254 0.0185 (---*---)

Theme 4 0.1702 0.0605 (---*---)

-+---+---+---+---0.00 0.10 0.20 0.30

Pooled StDev = 0.1057

Figure 11: Minitab ANOVA output for comparing the 5 roles of all 2nd person pronouns for Table 11

implying no significant effect as a whole (see Figure 11). On closer inspection of the result, however, the variance or standard deviation in the “experiencer” group is larger than all the other roles. The phenomenon has also been illustrated by Figure 12 since the range (or the “spread”) of experiencer is larger than all the others. The large variance will increase the overall within group sum of squares (WSS) or mean square for error (MSE), which in turn decrease the ANOVA F4,15

statistic since F4,15 = mean square for group

mean square for error = between group SS/(df. = 4)

within group SS/(df. = 15). Again if we regard the “experiencer” group as exceptional (outlier or contaminated), redoing the ANOVA test on the same table except the “experiencer” group will produce the p-value = 0.105, indicating a certain kind of significant evidence, though the effect is not as remarkable as in the sole “5” case.

In conclusion, different roles of the second person pronouns (especially “5”) have some influence on predicting questions.

Role

Ratio

Theme Others

Goal Experiencer

Agent 0.4

0.3

0.2

0.1

0.0

Boxplot of Ratio by Role

Figure 12: Boxplot of Q/(Q + ¬Q) ratio for different roles played by 4 second person pronouns. The rectangle (“box”) part shows the inter-quartile range (i.e., the first quartile, the median, and the third quartile), and the whiskers draw out to the maximum and minimum values since no data is beyond 1.5 inter-quartile range to be considered outlier here. The circle part shows the mean of all data

在文檔中 漢語問句偵測之量化研究 (頁 69-74)

相關文件