• 沒有找到結果。

System Evaluation

在文檔中 誌謝 (頁 57-64)

4.4 Result and Discussion

4.4.2 System Evaluation

According to the Figures in Section 4.4.1, stMIL usually reaches the top among all algo-rithms and parameters. Besides, the performance is less influenced when parameter value changes. Hence, we took stMIL as the training algorithm. Given stMIL, we set the pa-rameters as following: text+synt features, regularization C = 1, quadratic kernel, and bag size 10. The number of testing data differs from relation and bag size, which are listed as Table 4.8. For each iteration, we choose the top 50 positive pairs as the seeds of next iteration.

New Pairs Extracted from Corpus

In Table 4.9, we showed the extracted new pairs of the four relations, AtLocation, HasProp-erty, CapableOf, IsA, at the first iteration. The performace differs from the relations. The discussion of each relation is as following:

Figure 4.5: Regularization comparison HasProperty for MIL algorithms.

Figure 4.6: Kernel comparison of HasProperty for MIL algorithms.

Bag size 5 10 20 30

HasProperty 10653 3434 1075 520 CapableOf 23583 3781 1102 529 AtLocation 11659 3930 1286 657 IsA 48604 16060 4962 2496

Table 4.8: Number of testing data with different bag sizes for 4 relations

Figure 4.7: Bag size comparison of HasProperty for MIL algorithms.

HasProperty The first column in Table 4.9 shows the relation HasProperty. HasProp-erty(A,B) expresses that A has B as its property. The result of HasProperty makes sense.

For example, (範 圍, 廣) means the scope is wide; (可 能 性, 大) means the possibility is high; and (情 形, 普 遍) means the situation is common. The only nonsense pair of HasProperty is the 9th pair (者, 多), where “者” is not a complete word and the noise might come from preprocessing.

CapableOf The second column in Table 4.9 also performs well, which expresses the relation CapableOf. CapableOf (A,B) indicates that A is able to do B. A common noise happens in the relation. Sometimes the relation between A and B is a passively action.

For example, (消息, 傳來) should be modified as (消息, 被傳來) because information is conveyed by some means but is not able to convey itself.

AtLocation AtLocation and IsA has another problem that the extracted pairs, in some cases, do not express the target relations but the compound noun. For example, AtLoca-tion(A,B) suggests that B is the location of A, but the top three candidates (人文, 研究 所), (學生, 輔導組) and (生物, 研究所) are compound words, which mean “Graduate institute of humanity”, “Student assistance division”, “Graduate institute of biology”.

IsA The situation of compound noun is more serious in IsA. IsA(A,B) indicates that A is a subtype or specific instance of B. But the extracted pairs are words frequently used together. For example, after combining (兄 弟, 姊 妹) as a term 兄 弟 姊 妹, it means sibling. And (孩子, 們) could be combined as 孩子們, which means the plural of child.

The problem is very severe for IsA, and could not be fixed with few iterations.

Iterative Learning

The mechanism of iterative leaning is designed for solving the insufficiency problem of initial seeds. After extracting seeds at the first iteration, we analyze the result and find out the inappropriate pairs. These wrong cases could be added to the seed set of next iteration, and help machine learn from wrong examples. The improvement by extracting AtLocation in the second iteration is shown on Table 4.9. When considering the top 30 extracted pairs, the precision increases from 50% to 60%. The improvement is not very obvious when observing the top 30. But precision of top 10 results grows from 20% to 90%, which explains that the extraction result could be improved by human validation.

For the result of more iterations, Figure 4.8 shows the precision from the 1st to 6th iterations. For each relation, we evaluate the precision of the top 50 candidates. After being labeled, the candidates are fed as the seeds of next iteration. HasProperty has great improvement in the 2nd and 3rd iterations. AtLocation performs only 50% at first, but steadly grows afterwards. Although CapableOf is not outstanding, it reachs 72% as the best, by adding 14% from the first iteration. The results shows that iterative learning helps for relation extraction.

No. HasProperty CapableOf AtLocation IsA

1 範圍, 廣 消息, 傳來 人文, 研究所 觀眾, 朋友

2 可能性, 大 雞, 殺死 學生, 輔導組 經濟, 證券

3 風氣, 盛 問題, 追 生物, 研究所 兄弟, 姊妹

4 情形, 普遍 全民, 辦 政府, 中心 群眾, 運動

5 比率, 高 蛇, 咬 體育, 美國 高層, 人士

6 比例, 低 功課, 做完 財團, 基金會 經濟, 建設

7 力氣, 大 陽光, 普照 記者, 大陸 孩子, 們

8 情況, 嚴重 人, 批評 記者, 股市 教務, 主任

9 者, 多 業者, 推出 資訊, 市場 體育, 新聞

10 問題, 嚴重 警方, 逮捕 台北訊, 台北市 金融, 機構

11 速度, 慢 手, 持 記者, 新竹 遊戲, 規則

12 機率, 高 風, 吹 政府, 學校 專家, 們

13 頻率, 高 面, 露 人, 世界 朋友, 們

14 意願, 高 手, 執 體育, 台北 報章, 雜誌

15 能力, 強 焦點, 放 歷史, 研究所 經濟, 證券

16 範圍, 廣 矛頭, 指向 國家, 委員會 中樞, 神經

17 可能性, 高 老師, 帶 廠商, 市場 工人, 們

18 變化, 快 人, 穿 資訊, 世界 政府, 官員

19 知名度, 高 政府, 遷 記者, 公司 生命, 週期

20 影響, 深遠 門, 打開 歷史, 研究所 宗教, 信仰

21 意願, 高 業者, 分析 資訊, 市場 命運, 共同體

22 成本, 高 事情, 做對 都市, 中心 人民, 代表

23 功課, 好 媽媽, 看 政府, 臺 小朋友, 們

24 關係, 密切 老師, 教 記者, 基隆 交通, 工具

25 感情, 好 重心, 放 智財權, 國內 風土, 人情

26 意願, 高 人, 提出 產品, 市場 爸, 媽

27 負擔, 重 小朋友, 學習 記者, 台北市 專家, 們

28 層面, 廣泛 頭, 戴 電腦, 市場 動作, 動詞

29 效果, 不錯 人, 騎 教師, 國小 婚紗, 攝影

30 可能性, 大 兵, 分 記者, 基隆 言行, 舉止

Table 4.9: Top 30 pairs extracted from the corpus at the first iteration, sorted by the con-fident value.

first iteration second iteration

X 人文, 研究所 X 國家, 委員會 O 主席, 大陸 X 小組, 台灣

X 學生, 輔導組 O 廠商, 市場 O 女士, 研究所 X 社會, 研究所

X 生物, 研究所 O 資訊, 世界 X 人文, 研究所 X 兒童, 中心

X 政府, 中心 O 記者, 公司 O 義工, 醫院 O 資料庫, 圖書館

O 體育, 美國 X 歷史, 研究所 O 專利, 美國 X 社會, 研究所

X 財團, 基金會 X 資訊, 市場 O 工作, 公司 X 經濟, 臺灣

O 記者, 大陸 X 都市, 中心 O 教授, 大學 O 記者, 基隆

X 記者, 股市 O 政府, 臺 O 活動, 戶外 X 貿易, 公司

X 資訊, 市場 O 記者, 基隆 O 教授, 美國 X 國家, 委員會

X 台北訊, 台北市 O 智財權, 國內 O 先生, 大學 X 小組, 地區

O 記者, 新竹 O 產品, 市場 X 科學, 委員會 O 經理, 台灣

X 政府, 學校 O 記者, 台北市 O 技術, 國內 X 團體, 基金會

O 人, 世界 X 電腦, 市場 X 財團, 基金會 O 妹妹, 家

O 體育, 台北 O 教師, 國小 O 產品, 國外 O 院士, 院

X 歷史, 研究所 O 記者, 基隆 O 律師, 美國 O 工人, 工會

Table 4.10: Results of AtLocation at the first and second iteration

Figure 4.8: Comparison of the precision by iterations

Chapter 5 Conclusion

This chapter summarizes the work done in this thesis with the contribution. Then we discuss the future work of our relation extraction system, which could further help improve the existing knowledge base.

5.1 Summary and Contribution

This thesis is motivated by the problem that Chinese ConceptNet is lack of reliable data.

To solve the problem, we develop a framework for extracting concept pairs from corpus based on the existing pairs in knowledge base. The extracted pairs could be manipulated in knowledge expansion. The idea of distant supervision is adopted to avoid costly hu-man labeling work by assuming that if two entities participate in a relation, all sentences containing the two pairs convey the relation. To reduce the fault caused by the strong as-sumption, multiple instance learning (MIL) is embraced to dwindle the strong assumption.

Therefore, we assume that at least one sentence mentioning the entity pair expresses the relation.

Since multiple instance learning for distant supervision is introduced into the relation extraction problem, the system learns each relation with bags. The constraint on bags guarantees that positive bags contain at least one positive instance while negative bags contain all negative instances. Each pairs are expressed as a bag, consisting of sentences as instances, which are stored as natural-language-processing (NLP) feature formats.

In the experiment, concept pairs in Chinese ConceptNet are applied as seeds after la-beled with the correctness while sentences in Sinica Corpus serve as the source. Although not all relations could be efficiently extracted at the first iteration, the wrong output could be revised by manually labeling and fed as the seed in next iteration. The mistake could be corrected promptly and enhance the performance of relation extraction.

We also provide the brief comparison of different MIL algorithms. stMIL algorithm models the situation that positive bags contain few positive instances, which fits our data and problem. Among all MIL algorithms we have mentioned, stMIL performs the best in our problem.

To sum up, we proposed a iteratively learning framework, which requires little human efforts and generates nearly correct new pairs related to several relations from a corpus.

在文檔中 誌謝 (頁 57-64)

相關文件