4.3 後設可解釋性建模
4.3.2 後設可解釋性與蘊含關係
‧
I was nearly charged with petty theft for pilfering coffee at the illustrious Hippod rome Building. But lest I be judged too quickly I must convey the sublimity of th e fourth floor's coffee machine. Harry Houdini performed at the Hippodrome at 1 120 Avenue of the Americas near 44th Street. Many of the best and most famous performers of the time appeared there. It was one of the biggest and most succes sful theaters of its time capable of accommodating 5200 people.
Hypothesis 1: Harry Houdini was a magician.
Attention Span 1: Harry Houdini performed at the Hippodrome Relation 1: neutral
在【表 4-9】的情境中,兩者的關係為 neutral,但是可以透過關鍵的區間
‧
Speaking after he discovered that he would not face criminal charges Mr Green d isclosed that the officers who arrested him last November warned him that he co uld be given the longest possible sentence. "They said 'Do you realise that this of fence could lead to life imprisonment?'" Mr Green told BBC Newsnight. I just th ought this was absurd. "I assume it's because it's a common law offence therefore because there is no law on the statute book which I was alleged to have broken t hen presumably there is no set sentence for it."
Hypothesis 2: Mr. Green is the shadow immigration minister of the UK.
Attention Span 2: Mr Green disclosed that the officers who arrested him last November warned him
Model Accurate F1 Span
3MTL@SD 1.0 Harry Houdini performed at the Hippodr ome
Single@SD 0.74 Harry Houdini performed Pretrained@AS F 0.40 Harry Houdini performed at the Hippodr
ome at 1120 Avenue of the Americas ne ar 44th Street. Many of the best and most
famous performer
Single RTE@AS F 0.40 Harry Houdini performed at the Hippodr ome at 1120 Avenue of the Americas ne ar 44th Street. Many of the best and most
famous performer
2MTL@AS F 0.55 Harry Houdini performed at the Hippodr ome at 1120 Avenue of the Americas ne
ar 44th Street.
3MTL@AS T 0.74 Harry Houdini performed
‧
Relation 2: neutral
在【表 4-10】中,兩者的關係為 neutral,但是並沒有一個特定的區間能
3MTL@SD 0.85 Mr Green disclosed that the officers who arrested him last November Single@SD 0.66 Green disclosed that the officers who arr
ested
Pretrained@AS T 0.51 Mr Green disclosed that the officers who arrested him last November warned him that he could be given the longest possib le sentence. "They said 'Do you realise th at this offence could lead to life imprison
ment?'" Mr Green
Single RTE@AS F 0.53 Green disclosed that the officers who arr ested him last November warned him tha t he could be given the longest possible s entence. "They said 'Do you realise that t his offence could lead to life imprisonme
nt?'
2MTL@AS T 0.55 he discovered that he would not face cri minal charges Mr Green disclosed that th
e officers who arrested him 3MTL@AS T 0.61 he would not face criminal charges Mr G
reen disclosed that the officers who arres ted him
‧
Rain is pelting down on Dona Porcela's treatment room in Puerto Cabezas the main town on Nicaragua's Northern Caribbean coast. The room is barren except for a few plastic chairs a wooden table and some old plastic bottles balanced precariously on timber beams. Dona Porcela is a respected traditional healer here and the bottles are filled with her secret medicinal potions. Her patient today is a teenage girl asleep on a piece of cardboard serving as a mattress on the dirt floor.
Hypothesis 3: Dona Porcela is a healer.
Attention Span 3: Dona Porcela is a respected traditional healer Relation 3: entail
表 4-11 模型後設可解釋性案例分析三
【表 4-11】為 entail 的案例,我們可以觀察到在這個案例中,以注意力分 數擷取區間的方法其效能較高。在蘊含關係分類上,其結果也較一致,但可以
Model Accurat F1 Span
3MTL@SD 0.71 Porcela is a respected traditional healer here and the bottles
Single@SD 0.83 Porcela is a respected traditional healer here
Pretrained@AS F 0.83 Porcela is a respected traditional healer here
Single RTE@AS T 0.90 Porcela is a respected traditional healer 2MTL@AS T 1.0 Dona Porcela is a respected traditional
healer
3MTL@AS T 1.0 Dona Porcela is a respected traditional healer
‧
SNLI Dataset (Dev Accuracy %)Model 0.1 % 1 % 10 % 100 %
#Training Data 549 5,493 54,936 549,367 SNLI 65.0 80.9 87.2 90.7 SNLI@Explainable 69.1 82.9 87.7 91.0
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
我們隨機取樣了 100 %,10 %,1 % 和 0.1 % 的原始樣本數,並以 5 組不同 的 random seed 進行訓練。最終我們在 549 筆、5493 筆、54936 筆和 54.9367 萬筆樣本上共有兩種訓練方法的各五個結果。從【表 4-12】中可以看到在完整 的 SNLI 中,加入語義現象分類後的效能比單一自然語言推理高出 0.3,隨著 資料集樣本及效能減小,兩者的差距加大為 0.5,2.0 及 4.1,足見在其他自然 語言推理任務上,加入語義現象分類任務的知識,對主要任務有一定程度的幫 助。在效能差異顯著性分析上,我們同樣採用 McNemar 檢定進行計算,在取 樣樣本為 100 % 時,t = 6.43, p < 0.05,在取樣樣本為 10 % 時,t = 21.82, p <
0.05,在取樣樣本為 1 % 時,t = 7.40 , p < 0.05,在取樣樣本為 0.1 % 時,t = 131.65, p < 0.05。
4.4 人類認知與模型可解釋性
本節首先說明了請多位標註者進行 3.5.1 之注意力區間標註,並評估其解釋性 之結果。其後說明 3.5.4 以各個方式所擷取之區間,以競賽形式評估受測者之 解釋方式及對模型信任度之實驗。
4.4.1 人類解釋機制與模型之異同
本段將說明我們從 MTurk 取得標註資料之評估結果。在 90 份的標註任 務中(HIT),共有 13 位標註了 2 份 HIT ,因此我們除了原先標註的專家資 料外,共有來自於 77 位群眾標註者對於蘊含關係判斷解釋方式之資料。
在進行資料的蒐集時,我們發現標註者的思維模式有著明顯的差異,以下 為兩筆資料,來自 77 位中不同的 4 位標註者所標註的內容:
‧
"I'm not ruling out staying active in causes and issues but at this point I'm looking for a new line of work" Blagojevich said. "I'm trying to pursue a way to earn a living and do some interesting things. This is a chance to do something new and entertaining." Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor" — well now that he's no longer governor of Illinois. "(I'm pursuing) the reverse career path of Arnold's" he said.
Later in explaining his desire to join "I'm A Celebrity" he also invoked the 26th president of the United States Theodore Roosevelt.
Hypothesis: Blagojevich is the ex-governor of Illinois.
Relation: entailment
Annotator 1: Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor" — well now that he's no longer governor of Illinois
Annotator 2: I'm trying to pursue a way to earn a living and do some interesting things
Annotator 3: Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor" — well now that he's no longer governor of Illinois.
Annotator 4: This is a chance to do something new and entertaining."
Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor
‧
The alleged mastermind behind the London bombings was reported captured in Cairo, Egypt last week. Police believe that a U.S. trained chemist Magdi Asdi
Nashar, 33 helped build the bombs that killed over 50 people. Mr.
el-Nashar, who has a PhD from Leeds University, left England two weeks before the bo-mbings. After the London bombings British authorities initiated a worldwide manhunt that found him in Cairo. State security officials reported they have begun questioning el-Nashar with British agents in attendance.
Hypothesis: Cairo is situated in Egypt.
Relation: entailment Annotator 1: Cairo, Egypt
Annotator 2: The alleged mastermind behind the London bombings was reported captured in Cairo Egypt last week
Annotator 3: The alleged mastermind behind the London bombings was reported captured in Cairo Egypt last week.
Annotator 4: reported captured in Cairo Egypt last week.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
從上例我們可以發現,標註者 1 試圖用簡單兩個字來推論 Cairo 和 Egypt 之 間的關係,而標註者 3, 4 則偏向選取能夠包含較多情境資訊的完整句子。而標 註者 4 則選取介於兩者長度之間的區間,但是四位標註者都有包含到 Cairo Egypt 這個能夠決定兩者關係的關鍵資訊。
以下我們將說明兩個資料集(群眾、專家)在我們的方法中,可解釋性評 估指標的變化趨勢。
在【表 4-13】中我們可以看到,根據標註者思維模式的不同,其所標註的 區間長度也有所不同,而區間長度對評估指標的值有所影響。雖然標註區間的 長度對評估指標有所影響,但在專家及群眾的資料集中,我們可以發現相似的 變化趨勢。
加入第三個區間判斷任務的資料集,是以區間較短的專家資料集思維模式 進行訓練,因此我們假設模型有一個比較傾向關注短而關鍵資訊的偏好。在
【圖 4-1】我們可以觀察到,專家和群眾資料集兩者變化的趨勢類似,而專家 標註的測試資料集更能得益於訓練資料集的知識,在加入區間辨識任務後,效 能有更顯著的提升,在長而豐富的解釋上,進步的幅度相較來說則比較小,即 我們可以在此驗證在進行區間標註任務時,標註者所偏好或擁有的解釋思維模 式,同時也會影響模型在進行解釋時的解釋模式。
‧
2 Tasks 1 Task Pretrained
3 Tasks
2 Tasks 1 Task Pretrained
0.38 0.43 0.48 0.53 0.58
3 Tasks 2 Tasks 1 Task Pretrained
專家和群眾資料集
Expert Crowd
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
斷,群眾資料集所標註之資料,相較於專家資料集,在判斷蘊含關係與是否關 注到重要資訊之間,較難以歸納出一模式。
Entailment - Crowd
Model Accuracy (%) MAX@F1 SUM@F1 MAX@MRR SUM@MRR Overview
Pretrained 20.0 0.4066 0.4274 0.1197 0.1251 RTE 71.1 0.4199 0.4323 0.1214 0.1252 RTE+SP 73.4 0.4153 0.4373 0.1211 0.1249 RTE+SP+SD 75.1 0.4333 0.4540 0.1205 0.1272
Correct
Pretrained 0.4061 0.4285 0.1169 0.1258 RTE 0.4155 0.4273 0.1212 0.1249 RTE+SP 0.4211 0.4446 0.1236 0.1277 RTE+SP+SD 0.4342 0.4533 0.1231 0.1296
Incorrect
Pretrained 0.4068 0.4272 0.1204 0.1249 RTE 0.4305 0.4448 0.1218 0.1259 RTE+SP 0.3994 0.4176 0.1143 0.1174 RTE+SP+SD 0.4308 0.4564 0.1129 0.1200
表 4-14 群眾資料集蘊含關係為蘊含之可解釋性評估
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Neutral - Crowd
Model Accuracy (%) MAX@F1 SUM@F1 MAX@MRR SUM@MRR Overview
Pretrained 37.0 0.3420 0.3578 0.1032 0.1103 RTE 58.1 0.3517 0.3591 0.1052 0.1085 RTE+SP 44.3 0.3464 0.3317 0.1045 0.1096 RTE+SP+SD 70.8 0.3653 0.3840 0.1040 0.1106
Correct
Pretrained 0.3513 0.3580 0.1062 0.1155 RTE 0.3280 0.3396 0.1051 0.1081 RTE+SP 0.3144 0.3854 0.0949 0.1002 RTE+SP+SD 0.3566 0.3736 0.1027 0.1089
Incorrect
Pretrained 0.3366 0.3577 0.1014 0.1079 RTE 0.3839 0.3855 0.1054 0.1090 RTE+SP 0.3712 0.4053 0.1120 0.1169 RTE+SP+SD 0.3860 0.4089 0.1073 0.1148
表 4-15 群眾資料集蘊含關係為不相關之可解釋性評估
‧
Contradiction - Crowd
Model Accuracy (%) MAX@F1 SUM@F1 MAX@MRR SUM@MRR
Pretrained 0.3798 0.3989 0.1140 0.1185 RTE 0.3831 0.3900 0.1116 0.1217 RTE+SP 0.3722 0.3776 0.1054 0.1144 RTE+SP+SD 0.4173 0.4286 0.1127 0.1199
Incorrect
Pretrained 0.3742 0.3941 0.1104 0.1198 RTE 0.3905 0.4003 0.1129 0.1161
‧
Metric/Accuracy Correlation (SUM@F1)
Model Accuracy (%) Correct Incorrect Correlation Entailment
‧
關係的區間;Pretrained@AS 同樣給予可以推論關係的區間,但是 同時包含了大量多餘的、情境上的資訊。‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
- Contradiction: 和 entailment 的特性相同。
在進行模型信任度評估實驗時,我們希望受測者能夠正確的辨認三個解釋 方法之特性(來自哪個機器人),再依照其對模型解釋方法的偏好,選擇其認 為較能夠信任的解釋方法。
我們一共蒐集了 100 個受測者分別在 9 題分辨 3 個模型的問答中,得到 了 2700 個回答結果,在我們所蒐集的實驗結果混淆矩陣如【圖 4-2】所示。其 中間軸佔比為 74.14 %,即在總計 2700 個回答中,有 2002 個(74.14%)解 釋能夠被正確判斷來自哪個解釋方法。
圖 4-2 判斷機器解釋做答狀況之混淆矩陣
在判別我們的受測者是否能夠正確判斷解釋方法的特性時,我們先計算在 隨機現象下,各個答對題數之人數期望分布應為何。我們以式16 進行計算,在 我們的信任度實驗中,受測者一共需回答 9 題判斷解釋來自哪個解釋方法的問
‧
‧
‧
果【表 4-18】顯示,3MTL@SD、3MTL@AS 和 Pretrained@AS 三種解釋方 法,隨著輪次的進展,其投注的金額,也就是對於模型的信任度有顯著的不 One-way MANOVAMultivariate , Test
Univariate Test
Source Measure F Sig. Post-hoc
‧
為 Pretrained@AS (37),而最不受信任的方法為 3MLT@AS (12)。3MLT@AS 從第一輪到第四輪變化不大,但是經過四輪的下注,我們觀察到,雖然Round1 Round2 Round3 Round4
信任程度
輪次
人類信任度變化
3MTL@SD 3MTL@AS Pretrained@AS
‧
(12.66)顯著低於其他兩者。而 3MTL@SD(49.36) 高於 Pretrained@AS (37.95),達到接近顯著的差異(p = .052)。
表 4-19 Round1 One-way ANOVA
在【表 4-20】我們可以觀察到第二輪中,不同解釋方法(F ( , ) = 25.736, p = .000 )對於信任度仍然有顯的影響,但是這個顯著性主要來自 3MTL@SD(14.04)和其他兩者的差異。3MTL@SD(48.16)和
Pretrained@AS(37.08)之差異則較第一輪不顯著 (p = .092)。
Round1 One-way ANOVA
N Mean Std. Deviation
1: 3MTL@SD 79 49.39 28.54
2: 3MTL@AS 79 12.66 14.86
3: Pretrained@AS 79 37.95 24.88 Test of Within-Subjects Effect
df F Sig.
Method 2 33.732 .000
Error (Method) 156
Pairwise Comparisons
Method (I) Method (J) Mean Difference Std. Error Sig.
‧
3MTL@SD (42.23)和 Pretrained@AS(43.54)之差異基本上已經消失。
Round2 One-way ANOVA
N Mean Std. Deviation
1: 3MTL@SD 79 48.16 28.63
2: 3MTL@AS 79 14.04 16.72
3: Pretrained@AS 79 37.08 27.92 Test of Within-Subjects Effect
df F Sig.
Method 2 25.736 .000
Error (Method) 156
Pairwise Comparisons
Method (I) Method (J) Mean Difference Std. Error Sig.
‧
在第四輪中【表 4-22】,3MTL@SD (39.81)和 Pretrained@AS
(49.23)的投注金額雖然沒有無顯著差異 (p = .119),但是已經產生清楚的反 轉。3MTL@SD 仍然最不受信任,其平均數(10.96)仍然顯著地低於其他兩個 模型。
Round3 One-way ANOVA
N Mean Std. Deviation
1: 3MTL@SD 79 42.23 29.33
2: 3MTL@AS 79 14.23 17.94
3: Pretrained@AS 79 43.54 29.24 Test of Within-Subjects Effect
df F Sig.
Method 2 21.258 .000
Error (Method) 156
Pairwise Comparisons
Method (I) Method (J) Mean Difference Std. Error Sig.
‧
Pretrained@AS 從第一輪到第四輪得到更多的信任(+11.27,p=.000),而 3MTL@AS 從第一輪到第四輪的變化不大(+1.69,p=.291)。由此可見,在實 驗的過程中,3MTL@SD 逐漸喪失信任,而 Pretrained@AS 逐漸得到信任。
Round4 One-way ANOVA
N Mean Std. Deviation
1: 3MTL@SD 79 39.81 26.78
2: 3MTL@AS 79 10.96 12.60
3: Pretrained@AS 79 49.23 27.73 Test of Within-Subjects Effect
df F Sig.
Method 2 38.174 .000
Error (Method) 156
Pairwise Comparisons
Method (I) Method (J) Mean Difference Std. Error Sig.
‧
I was nearly charged with petty theft for pilfering coffee at the illustrious Hippodrome Building. But lest I be judged too quickly I must convey the sublimity of the fourth floor's coffee machine. Harry Houdini performed at the Hippodrome at 1120 Avenue of the Americas near 44th Street. Many of the best and most famous performers of the time appeared there. It was one of the biggest and most successful theaters of its time capable of accommodating 5200 people.
Hypothesis: Harry Houdini was a magician.
Relation: neutral
3MTL@SD: " Harry Houdini performed at the Hippodrome "
3MTL@AS: " Harry Houdini performed "
Pretrained@AS: " Harry Houdini performed at the Hippodrome at 1120 Avenue of the Americas near 44th Street. Many of the best and most famous performers "
第四輪 – 第一輪之投注金額
Difference t df Sig.
3MTL@SD -9.58 3.33 78 .001
Pretrained@AS +11.27 -3.70 78 .000
3MTL@AS 1.69 1.06 78 .291