人類解釋機制與模型之異同 - 在 SNLI 資料集之適應性 - 自然語言推理之後設可解釋性建模

4.3 在 SNLI 資料集之適應性

4.4.1 人類解釋機制與模型之異同

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

我們隨機取樣了 100 %，10 %，1 % 和 0.1 % 的原始樣本數，並以 5 組不同的 random seed 進行訓練。最終我們在 549 筆、5493 筆、54936 筆和 54.9367 萬筆樣本上共有兩種訓練方法的各五個結果。從【表 4-12】中可以看到在完整的 SNLI 中，加入語義現象分類後的效能比單一自然語言推理高出 0.3，隨著資料集樣本及效能減小，兩者的差距加大為 0.5，2.0 及 4.1，足見在其他自然語言推理任務上，加入語義現象分類任務的知識，對主要任務有一定程度的幫助。在效能差異顯著性分析上，我們同樣採用 McNemar 檢定進行計算，在取樣樣本為 100 % 時，t = 6.43, p < 0.05，在取樣樣本為 10 % 時，t = 21.82, p <

0.05，在取樣樣本為 1 % 時，t = 7.40 , p < 0.05，在取樣樣本為 0.1 % 時，t = 131.65, p < 0.05。

4.4 人類認知與模型可解釋性

本節首先說明了請多位標註者進行 3.5.1 之注意力區間標註，並評估其解釋性之結果。其後說明 3.5.4 以各個方式所擷取之區間，以競賽形式評估受測者之解釋方式及對模型信任度之實驗。

4.4.1 人類解釋機制與模型之異同

本段將說明我們從 MTurk 取得標註資料之評估結果。在 90 份的標註任務中（HIT），共有 13 位標註了 2 份 HIT ，因此我們除了原先標註的專家資料外，共有來自於 77 位群眾標註者對於蘊含關係判斷解釋方式之資料。

在進行資料的蒐集時，我們發現標註者的思維模式有著明顯的差異，以下為兩筆資料，來自 77 位中不同的 4 位標註者所標註的內容：

‧

"I'm not ruling out staying active in causes and issues but at this point I'm looking for a new line of work" Blagojevich said. "I'm trying to pursue a way to earn a living and do some interesting things. This is a chance to do something new and entertaining." Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor" — well now that he's no longer governor of Illinois. "(I'm pursuing) the reverse career path of Arnold's" he said.

Later in explaining his desire to join "I'm A Celebrity" he also invoked the 26th president of the United States Theodore Roosevelt.

Hypothesis: Blagojevich is the ex-governor of Illinois.

Relation: entailment

Annotator 1: Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor" — well now that he's no longer governor of Illinois

Annotator 2: I'm trying to pursue a way to earn a living and do some interesting things

Annotator 3: Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor" — well now that he's no longer governor of Illinois.

Annotator 4: This is a chance to do something new and entertaining."

Blagojevich first compared himself to Arnold Schwarzenegger whom he described as "my favorite governor

‧

The alleged mastermind behind the London bombings was reported captured in Cairo, Egypt last week. Police believe that a U.S. trained chemist Magdi Asdi

Nashar, 33 helped build the bombs that killed over 50 people. Mr.

el-Nashar, who has a PhD from Leeds University, left England two weeks before the bo-mbings. After the London bombings British authorities initiated a worldwide manhunt that found him in Cairo. State security officials reported they have begun questioning el-Nashar with British agents in attendance.

Hypothesis: Cairo is situated in Egypt.

Relation: entailment Annotator 1: Cairo, Egypt

Annotator 2: The alleged mastermind behind the London bombings was reported captured in Cairo Egypt last week

Annotator 3: The alleged mastermind behind the London bombings was reported captured in Cairo Egypt last week.

Annotator 4: reported captured in Cairo Egypt last week.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

從上例我們可以發現，標註者 1 試圖用簡單兩個字來推論 Cairo 和 Egypt 之間的關係，而標註者 3, 4 則偏向選取能夠包含較多情境資訊的完整句子。而標註者 4 則選取介於兩者長度之間的區間，但是四位標註者都有包含到 Cairo Egypt 這個能夠決定兩者關係的關鍵資訊。

以下我們將說明兩個資料集（群眾、專家）在我們的方法中，可解釋性評估指標的變化趨勢。

在【表 4-13】中我們可以看到，根據標註者思維模式的不同，其所標註的區間長度也有所不同，而區間長度對評估指標的值有所影響。雖然標註區間的長度對評估指標有所影響，但在專家及群眾的資料集中，我們可以發現相似的變化趨勢。

加入第三個區間判斷任務的資料集，是以區間較短的專家資料集思維模式進行訓練，因此我們假設模型有一個比較傾向關注短而關鍵資訊的偏好。在

【圖 4-1】我們可以觀察到，專家和群眾資料集兩者變化的趨勢類似，而專家標註的測試資料集更能得益於訓練資料集的知識，在加入區間辨識任務後，效能有更顯著的提升，在長而豐富的解釋上，進步的幅度相較來說則比較小，即我們可以在此驗證在進行區間標註任務時，標註者所偏好或擁有的解釋思維模式，同時也會影響模型在進行解釋時的解釋模式。

‧

2 Tasks 1 Task Pretrained

3 Tasks

2 Tasks 1 Task Pretrained

0.38 0.43 0.48 0.53 0.58

3 Tasks 2 Tasks 1 Task Pretrained

專家和群眾資料集

Expert Crowd

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

斷，群眾資料集所標註之資料，相較於專家資料集，在判斷蘊含關係與是否關注到重要資訊之間，較難以歸納出一模式。

Entailment - Crowd

Model Accuracy (%) MAX@F1 SUM@F1 MAX@MRR SUM@MRR Overview

Pretrained 20.0 0.4066 0.4274 0.1197 0.1251 RTE 71.1 0.4199 0.4323 0.1214 0.1252 RTE+SP 73.4 0.4153 0.4373 0.1211 0.1249 RTE+SP+SD 75.1 0.4333 0.4540 0.1205 0.1272

Correct

Pretrained 0.4061 0.4285 0.1169 0.1258 RTE 0.4155 0.4273 0.1212 0.1249 RTE+SP 0.4211 0.4446 0.1236 0.1277 RTE+SP+SD 0.4342 0.4533 0.1231 0.1296

Incorrect

Pretrained 0.4068 0.4272 0.1204 0.1249 RTE 0.4305 0.4448 0.1218 0.1259 RTE+SP 0.3994 0.4176 0.1143 0.1174 RTE+SP+SD 0.4308 0.4564 0.1129 0.1200

表 4-14 群眾資料集蘊含關係為蘊含之可解釋性評估

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Neutral - Crowd

Model Accuracy (%) MAX@F1 SUM@F1 MAX@MRR SUM@MRR Overview

Pretrained 37.0 0.3420 0.3578 0.1032 0.1103 RTE 58.1 0.3517 0.3591 0.1052 0.1085 RTE+SP 44.3 0.3464 0.3317 0.1045 0.1096 RTE+SP+SD 70.8 0.3653 0.3840 0.1040 0.1106

Correct

Pretrained 0.3513 0.3580 0.1062 0.1155 RTE 0.3280 0.3396 0.1051 0.1081 RTE+SP 0.3144 0.3854 0.0949 0.1002 RTE+SP+SD 0.3566 0.3736 0.1027 0.1089

Incorrect

Pretrained 0.3366 0.3577 0.1014 0.1079 RTE 0.3839 0.3855 0.1054 0.1090 RTE+SP 0.3712 0.4053 0.1120 0.1169 RTE+SP+SD 0.3860 0.4089 0.1073 0.1148

表 4-15 群眾資料集蘊含關係為不相關之可解釋性評估

‧

Contradiction - Crowd

Model Accuracy (%) MAX@F1 SUM@F1 MAX@MRR SUM@MRR

Pretrained 0.3798 0.3989 0.1140 0.1185 RTE 0.3831 0.3900 0.1116 0.1217 RTE+SP 0.3722 0.3776 0.1054 0.1144 RTE+SP+SD 0.4173 0.4286 0.1127 0.1199

Incorrect

Pretrained 0.3742 0.3941 0.1104 0.1198 RTE 0.3905 0.4003 0.1129 0.1161

‧

Metric/Accuracy Correlation (SUM@F1)

Model Accuracy (%) Correct Incorrect Correlation Entailment

在文檔中自然語言推理之後設可解釋性建模 - 政大學術集成 (頁 85-93)

人類解釋機制與模型之異同

4.3 在 SNLI 資料集之適應性

4.4.1 人類解釋機制與模型之異同

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

4.4 人類認知與模型可解釋性

4.4.1 人類解釋機制與模型之異同

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

專家和群眾資料集

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧

立政治大學

立政治大學

立政治大學

立政治大學