誌謝

(1)

國立臺灣大學電機資訊學院資訊工程學系碩士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

半監督式學習於中文關係抽取以擴充知識庫之研究 Chinese Relation Extraction by Semi-Supervised Learning

for Knowledge Base Expansion

陳昱儒 Yu-Ju Chen

指導教授：許永真博士

Advisor: Jane Yung-jen Hsu, Ph.D.

中華民國 104 年 10 月 October, 2015

(2)

(3)

(4)

(5)

誌謝

三年的研究生生活即將告終，這一路走來，最感謝我的指導教授

——許永真老師。從大學專題開始跟著老師做研究，從一開始的不知所措，到之後漸入佳境。每一次在研究上迷失方向時，老師都會提出很好的想法與建議，讓我在挫折之時能夠打起精神繼續努力。除了研究上與課業上的指導，老師的生活體驗與小故事也讓我大開眼界，和老師相處就像讀一本好書，能夠引發思考與激勵人心。

論文的完成也要謝謝論文口試委員，在口試時給予專業的評論與建議。尤其感謝蔡宗翰老師，在口試前替我指出許多需要注意的問題。

研究的路上並不孤獨。謝謝 iAgent 實驗室的同學、學長姐及學弟妹，無論是生活上的陪伴，或是研究上的互助，都是研究生活中重要的精神糧食。

最後，謝謝家人與朋友的支持與關心，讓我能夠心無旁騖地完成我的論文。特別感謝張雅軒，無時無刻地給予我許多協助，幫助我度過難關。

(6)

(7)

摘要

「關係抽取」(Relation Extraction) 意指從文本中學習有語意關係的詞對（Concept Pair），例如（台北，台灣）的關係是「... 位於...」。此論文探討藉由關係抽取以擴增常識知識庫的方法。監督式學習是目前發展完整的方法之一，但是必須要有大量的標記資料才能達到好的效果。取而代之的是疏離監督式學習。疏離監督式學習是半監督式學習的一種，過去被用在無標記資料的關係抽取。針對知識庫中的某個關係，找出相關的詞對作為基礎，以此對大量未標記的文本自動做弱標記（Weakly Label），並作為訓練資料。這些詞對被預先標記關係，文本中提及這些詞對的句子會被自動標記與詞對相同的關係。此方法可以快速標記大量資料，但是當文本與知識庫的來源沒有關聯性時，標記的結果會很不可靠。

為了減少錯誤標記造成的學習錯誤，我們在疏離監督式學習中加入多實例學習的假設。多實例學習的訓練資料必須為袋裝形式，用於學習二元分類。每一袋訓練資料都會有 +1 或 -1 標記。標記為 +1 的袋子中包含至少一個 +1 的實例；標記為 -1 的袋子只會有 -1 的實例。我們將提及同一種詞對的句子裝進袋中，並使用多實例學習對未知的袋子做分類。

我們以語意網（ConceptNet）作為標記基礎，中研院平衡語料庫的文本當作訓練資料，實作中文關係抽取的實驗，並比較單實例學習與多種多實例學習演算法的實驗結果。該實驗從文中抽取下列四種關係的詞對：AtLocation ，CapableOf ，HasProperty ，及 IsA 。這個研究證實了我們的方法能夠藉由其他語料改進知識庫。

(8)

(9)

Abstract

This thesis investigates relation extraction, which learns semantic relations of concept pairs from text, as an approach to mining commonsense knowledge. To achieve good performance, state-of-the-art supervised learning requires a large labeled training set, which is often expensive to prepare. As an alternative, distant supervision, a semi-supervised learning method, was adopted to extract relations from unlabeled corpora. A training set consisting of a large amount of sentences can be weakly labeled automatically based on a set of concept pairs for any given relation in a knowledge base.

Labels generated with heuristics can be quite noisy. When the sources of sentences in the training set are not correlated with the knowledge base, the automatic labeling mechanism is unreliable. Instead of assuming all sentences are labeled correctly in the training set, multiple instance learning learns from bags of instances, provided that each positive bag contains at least one positive instance while negative bags contain only negative instances.

We conducted experiments on relation extraction in Chinese using concept pairs in ConceptNet, a commonsense knowledge base, as the seeds for labeling a set of predefined relations. The training bags were generated from the Sinica Corpus. The performance of multiple instance learning is compared with single-instance learning and a few other learning algorithms. Our experiments extracted new pairs for relations “AtLoca- tion”, “CapableOf”, “HasProperty”, and “IsA”. This study showed that a knowledge base can be improved by another corpus using the proposed approach.

(10)

(11)

List of Figures

1.1 Graph structure in ConceptNet . . . 3

1.2 Simple illustration of problem . . . 3

3.1 Framework of the relation extraction system . . . 15

3.2 Example of bag . . . 17

3.3 Example of feature - parsing tree . . . 20

4.1 Distribution of seed occurrence . . . 32

4.2 Process for generating bags for a seed . . . 33

4.3 Relationship between bag size and seed number . . . 33

4.4 Feature comparison of HasProperty for MIL algorithms . . . . 39

4.5 Regularization comparison of HasProperty for MIL algorithms . . . . 40

4.6 Kernel comparison of HasProperty for MIL algorithms . . . . 40

4.7 Bag size comparison of HasProperty for MIL algorithms . . . . 41

4.8 Comparison of the precision by iterations . . . 44

(16)

(17)

List of Tables

2.1 Relation types and subtypes defined in ACE 2005 . . . 7

3.1 List of features . . . 19

3.2 Example of feature - part-of-speech tag . . . 20

3.3 Example of feature - dependency . . . 20

4.1 List of relations of Chinese ConceptNet . . . 28

4.2 Example of correct and incorrect pairs in Chinese ConceptNet . . . 29

4.3 Example of sentences in Sinica Corpus . . . 30

4.4 Four relations used in experiment . . . 31

4.5 The most frequent pairs of part-of-speech tags of four relations . . . 31

4.6 Description of algorithms used in experiment . . . 34

4.7 List of feature types used in experiment . . . 38

4.8 Number of testing data with different bag sizes for 4 relations . . . 40

4.9 Top 30 pairs extracted from the corpus at the first iteration . . . 43

4.10 Results of AtLocation at the first and second iteration . . . . 44

(18)

(19)

Chapter 1 Introduction

This thesis begins with an overview. Chapter 1 includes motivation and problem description in the first two sections. Regarding the problem, the proposed solution is provided in the following section. The last section in this chapter indicates a guideline of this thesis.

1.1 Motivation

In the age of information explosion, the amount of knowledge grows rapidly. Either new knowledge created, or old knowledge restated is raw information, which is not practical for machine. The gap between raw and practical information should be filled. Extracting knowledge from raw data, and representing knowledge as a machine-readable structure are essential for computers so as to reason the world. To accomplish the purpose, knowledge base is created to store and provide information for computers.

The construction of knowledge base takes efforts and sometimes relies on human intelligence. In addition, The completeness and reliability are concerned when knowledge bases are put into practices. An existing knowledge base could be enriched by adding new sources. The huge amounts of knowledge created with natural languages could be a source for enhancing the completeness and reliability of a knowledge base. In this thesis, we took the Chinese part of ConceptNet[19], a multilingual common sense knowledge base, as the target to be improved.

(20)

1.1.1 Knowledge in Natural Language

Knowledge could be expressed as many forms, including sentences. Many sentences and articles are created on the web, as the form of news, blog, technical articles, and so on. It remains challenging for machines to understand the meaning in articles, which are created with natural languages. Therefore, transforming knowledge to machine readable forms would remove the barrier. For example, the sentence “National Taiwan University (NTU) is an university located in Taipei” contains 2 statements: “NTU is an university” and

“NTU is located in Taipei”. If these 2 statements could be stored in a knowledge base, the computer will understand that NTU is an university in Taipei.

1.1.2 Situation of Chinese ConceptNet

ConceptNet[19] is a multilingual knowledge base containing the knowledge implying human common sense. The source of ConceptNet includes other knowledge bases, dictionar- ies and crowdsourcing, mostly in English. For the Chinese data in ConceptNet, the only source is crowdsourcing from an online game[16]. Crowdsourcing may induce noisy data when the verification mechanism is not strict enough. Adding another source to Chinese ConceptNet may strengthen the reliability of data. It also helps solve the problem of biased knowledge in ConceptNet.

1.2 Problem Description

Given the huge amount of information created as articles, in order to add new source to existing Chinese knowledge base, this thesis aims at extracting information from Chinese sentences. We choose the graph-structured knowledge base as the target to improve. Tak- ing ConceptNet as an example, a node represents a concept, and an edge between two nodes represents the relation between two concepts. Two nodes and the edge between them convey a circumstance and the propositional expression is defined as an assertion.

In Figure 1.1, the statement “a car is used for driving” is kept in the graphical structure, where “car” and “drive” are 2 concepts and “UsedFor” is the relation between them.

(21)

Figure 1.1: Illustration of the graphical structure in ConceptNet. Image source:

https://github.com/commonsense/conceptnet5/wiki/Graph-structure

Figure 1.2: Relation extraction from collection of document.

The objective of the this thesis is creating more assertions to improve the quality and quantity of ConceptNet. We fixed a certain relation, and generated new concept pairs regrading the target relation from sentences. That is, the task completed in this thesis is relation extraction from document. Furthermore, the problem could be concretely viewed as transforming data from articles to pairs of concepts and relations. As Figure 1.2 shows, for each relation, a list of concept pairs are generated from the articles.

1.3 Proposed Solution

To solve the problem, we proposed a process to transform sentences to pairs related to specific relations. For one relation, a set of seed pairs is given for training a model, which could predict the relation of new pairs. Each sentence is preprocessed as a set of entities, and then candidate pairs are generated for predicting the relation in between. The algorithm for training the model is a multiple instance learning method, which classifies the candidate pairs as different relations. The training process executes iteratively to figure out more new pairs. The detailed solution is explained in Chapter 3.

(22)

1.4 Thesis Organization

After the overview of this thesis in Chapter 1, the background about knowledge and related work of relation extraction are provided in Chapter 2. The related work is categorized with different approaches, followed with the work specializing in Chinese. Chapter 3 provides the problem definition and framework, along with the learning algorithm. Experiment settings and results are illustrated in Chapter 4, with the discussion about the methods.

Finally, Chapter 5 concludes this work and bring up the future work.

(23)

Chapter 2 Background

In this chapter, the first section provides the introduction of knowledge representation and extraction. The second section accounts for the related work about relation extraction, including the methods of supervised learning, distant supervision, and multiple instance learning. The last section describes the characteristics in Chinese relation extraction with the related work.

2.1 Knowledge

In real world, knowledge exists as different forms, including text, picture, and so on. Hu- mans could understand the meaning and retrieve the knowledge as free texts. As for computers, machine-readable representations are required for reasoning. Before representing the knowledge for computers, the knowledge in unstructured format should be extracted.

2.1.1 Knowledge Representation

Knowledge is the understanding of information in specific domains and is used for solving problems or reasoning the world. Thus, representation schemes are required to perform knowledge [23]. In AI Magazine 1999, Randall Davis[12] addressed 5 roles of knowledge representation: (1) a surrogate for reasoning the world, (2) a set of ontological commit- ment used for answering questions, (3) a fragmentary theory of intelligent reasoning for inferencing, (4) a medium for efficient computation, (5) a medium of human expression.

(24)

To accomplish the requirements of knowledge representation, several schemes are created such as Frame[21], Semantic Web[5] and ontologies. Semantic web is a common form for knowledge representation created by Tim Berners-Lee[5]. It is a directed graph with labeled edges, where nodes are concepts and edges between nodes are semantic relations of two concepts. In Chapter 1, we introduced ConceptNet, which is a semantic graph representing commonsense knowledge.

2.1.2 Knowledge Extraction

Knowledge extraction is the process for establishing knowledge from one source to another, and the target should be machine-readable. The source could be not only structured data such as database and ontology, but also unstructured data such as text, image or video.

Knowledge extraction from natural language sources aims at deriving information from human language text, and the semantic relation in text is one of the targets.

Automatic Content Extraction (ACE)¹, as a track of Text Analysis Conference (TAC) after 2009, aims at developing novel methods to extract information from natural language text. In this program, entities, relations, and events are extracted. After 2003, the program contains multilingual tracks, including English, Arabic and Chinese. The released data, created by Linguistic Data Consortium (LDC), was used for supervised learning. Some work related to this thesis will be discussed in Section 2.2.1.

The relation extraction track in ACE defined several relations in multiple languages.

In ACE 2005, the relations was categorized as several types² (shown in Table 2.1), and the task aimed at extracting entity pairs of these relations.

2.2 Related Work of Relation Extraction

In 1999, Brin [7] extracted book information, the relation of author and title, with pattern matching method. Since then, more work of relation extraction were created. Supervised

1Automatic Content Extraction: https://www.ldc.upenn.edu/collaborations/past-projects/ace

2ACE 2005 Chinese Annotation Guidelines for Entities: https://www.ldc.upenn.edu/sites/www.ldc.

upenn.edu/files/chinese-entities-guidelines-v5.5.pdf

(25)

Relation Type Subtype

Physical located, near

Part-Whole geographical, subsidiary, artifact Personal-Social business, family, lasting-personal ORG-Affiliation

employment, ownership, founder, student-alum, sports-affiliation, investor-shareholder, membership Agent-Artifact user-owner-inventor-manufacturer General-Affiliation citizen-resident-religion-ethnicity,

org-location-origin

Table 2.1: Relation types defined in ACE 2005.

learning were often used after the ACE competition held the relation extraction track.

With the growing amount of information, supervised learning was no longer affordable to deal with such huge data. Therefore, semi-supervised learning methods were employed to solve the problem. One of the semi-supervised learning method – distant supervision[22]

was proposed in 2009, dealing with large number of unlabeled data with the assistance of an exterior knowledge base. However, if the data and the knowledge base are hardly dependent, the automatic labeling mechanism may cause wrong labeling. To avoid the fatal consequence, multiple instance learning algorithms were adopted to distant supervision. More details about relation extraction of supervised learning, distant supervision, and multiple instance learning are stated in this section.

2.2.1 Supervised Learning

When applying supervised learning to relation extraction problem, a set of training data is required and the problem is formulated as a classification task. When considering a single relation, the problem could be viewed as a binary classification task and aims at deciding whether the relation exists in the given entity pairs. Given a sentence, also a sequence of words s = (w₁, w₂, ..., w_n), and one relation r, the function f_r decides whether the

(26)

relation r exists between word pair (w_i, w_j), 1 ≤ i, j ≤ n. fr is formulated as following:

f_r(T (s, w_i, w_j)) =









1, if w_i and w_j are linked by r.

−1, otherwise.

T is a function transforming the data to feature form.

(2.1)

In a survey paper from Bach[4], methods of supervised learning relation extraction could be separated as 2 categories – feature-based and kernel-based methods. Consider- ing a supervised learning process, the former focuses on data while the latter focuses on algorithm.

Feature-Based Methods

Feature-based methods emphasize on the features used for training a classifier. These methods usually made use of a state-of-the-art supervised learning algorithm and opti- mized by choosing better combination of data features. Zhou[32] trained a relation classifier with SVM and implemented on ACE data, which is labeled with 24 ACE relation types. This work emphasized on the features extracted for training, including lexical, syntactic and semantic features. More than 30 features were considered. Other than the NLP features, entity types defined by ACE, tags from WordNet and name list of countries are also utilized as features.

Kernel-Based Methods

Kernel-based methods learn a classification function without transforming data to feature representation. A kernel function K is a similarity function defined on an object space X, such that K : X× X → [0, 1]. K(x, y) refers to a similarity value of x and y where x, y ∈ X. The object in X could be a sentence (sequences of words), a bag of words or dependency path representation. For a specific relation r, let x⁺ be a positive instance conveying r, and x⁻ be a negative instance. Then the situation K(x⁺, y) ≥ K(x⁻, y) implies that y is more similar to x⁺than x⁻. That is, y is more likely to convey r.

Zelenko[31] proposed a kernel-based method dealing with 2 relations: person-affiliation

(27)

and organization-location. In this work, shallow parsing, an tree-structured analysis of sentences, is used as the representation of data. The sparse subtree kernel is adopted for defining the similarity between two sentences.

2.2.2 Distant Supervision

Since supervised learning requires huge labeling work, which is usually costly, distant supervision was developed for dealing with vast data with minor labeling work. Instead of individually labeling by human, heuristics such as knowledge base help label the training data.

Snow[28] took WordNet, an ontology expressing hypernym and hyponym relations, as the knowledge base and learned the hypernym (is-a relation) from texts. Given the pairs of words in WordNet, sentences which contain both words in one pair are transformed to dependency path features. With these features, a hypernym classifier is trained to test novel pairs.

Instead of dealing with only hypernym relation, Mintz[22] discovered pairs of multiple relations. Mintz used Wikipedia as the corpus and Freebase[6] as the heuristic knowledge base for training a classification model. In this work, a strong assumption of distant supervision holds: if two entities participate in one relation from Freebase, then any sentence containing the two entities can represent the relation. Then the sentences are transformed to feature forms, including syntactic, semantic and lexical features. With these features, a classifier is trained with logistic regression. The system generates new instances from large corpus and could deal with hundreds of relations.

2.2.3 Multiple Instance Learning

Given distant supervision, the labeling effort is heavily reduced but causes the problem of noise when the data source of corpus and knowledge base are not correlated. For example, considering the 2 sentences “Alice was born in Taipei” and “Alice went to Taipei on Sat- urday”, both contain the two entities “Alice” and “Taipei”. The former sentence indicates the relation BornIn while the latter sentence expresses the relation WentTo. The example

(28)

shows that multiple relations could be conveyed in different sentences which contain the same pair of words. Riedel[26] considered the cases that the distant supervision assumption is violated. Taking Freebase as the assistant knowledge base for labeling sentences in 2 corpora, Wikipedia and New York Time Corpus, Riedel found that 31% labels for New York Time Corpus violate the assumption while only 13% for Wikipedia. To avoid the unreason of the strong assumption, multiple instance learning is applied to this problem.

Multiple instance learning (MIL) learns a classifier based on a set of training bag, where data are collected with some policies[2]. MIL has been implied on several tasks such as drug discovery, text classification, image classification, and so forth.

Bunescu[8] applied multiple instance learning to relation extraction problem, and the bags are packed by the index representing 2 entities. Yao[30] and Riedel[26] considered the entities and mentions at the same times. The entity pairs and the sentences mentioning both entities are modeled in a conditional probability distribution. Then the unlabeled mentions would be given an probabilistic value deciding the possibility that the relation exists in the sentence. Surdeanu[29] extended the problem to multiple-instance-multiple- label problem, which models the mentions of pairs, with the labels of relations. One model could deal with multiple labels. Hence, the method deals with multiple relations simultaneously.

2.3 Relation Extraction in Chinese

Most work of relation extraction deals with English data. With the growing amount of Chi- nese information, more researchers take Chinese data as the research target. This section introduces the characteristics of Chinese relation extraction with related work.

2.3.1 Characteristics in Chinese Relation Extraction

Relation extraction is a language-dependent task. Although the methods may not differ from languages, the performance may be influenced by the language of data. Chinese has some language features differing from English:

(29)

• The order of concepts and relation[24]: Given a specific relation and two concepts linked by the relation, the order of the concepts and the relation in sentences may differ. Take the sentence “歐巴馬總統畢業於哈佛法學院 (President Obama graduated from Harvard Law School)” as an example:

– 歐巴馬總統從哈佛法學院畢業

(President Obama, from, Harvard Law School, graduate) – 歐巴馬總統畢業於哈佛法學院

(President Obama, graduate, from, Harvard Law School)

The two sentences have same meaning but the statement order of GraduateFrom (President Obama, Harvard Law School) could be (President Obama, Grad- uateFrom, Harvard Law School) or (President Obama, Harvard Law School, GraduateFrom).

• No word boundary[17]: For featured-based methods, the selection of features may have impact on the result. When taking the sequence of words as one of the features, the pattern of word sequence may differ. For example, the sentence “歐巴馬總統畢業於哈佛法學院” (President Obama graduated from Harvard School) segmented with different tool are list as following:

– CKIP³Word Segmenter⁴: 歐巴馬總統畢業於哈佛法學院 – Stanford Word Segmenter⁵: 歐巴馬總統畢業於哈佛法學院

The correct segmentation is 歐巴馬 (Obama), 總統 (president), 畢業 (graduate), 於 (from), 哈佛 (Harvard), 法學院 (law school). There is slightly difference between the results of CKIP word segmenter and Stanford word segmenter. The longer the sentence is, the more ambiguity and more difficulty exist when segmenting the sentence. Since it is difficult to segment Chinese sentences to words accurately, a feature-based method proposed by Li[17] adopted the character-based word as feature for relation extraction.

3Chinese Knowledge and Information Process

4CKIP Chinese Segmenter (中研院中文斷詞系統): http://ckipsvr.iis.sinica.edu.tw

5Stanford Word Segmenter: http://nlp.stanford.edu/software/segmenter.shtml

(30)

• No morphology and lack of function word[24]: In Chinese, nouns and verbs could be represented as only one form. For example, both “cat” and “cats” are “貓”; both

“eat” and “ate” are “吃”. Otherwise, Chinese is lack of function words. English words such as “at”, “on”, “in” in some cases could be mapped to one Chinese word

“在”, which may cause ambiguity when analysing Chinese sentences.

2.3.2 Related Work in Chinese

Feature-Based Supervised Learning

Li[17] proposed the feature-based approach on Chinese relation extraction and implemented on the TAC 2005 data. To avoid the problem of wrong segmentation, uni-grams and bi-grams substitute for the word. The features of training data include entity type, entity context (characters inside or outside the two entities), and entity position structures (the positional relation of the two entities, e.g., nested, separated, adjacent).

Kernel-Based Supervised Learning

Kernel-based methods were used by Huang[15], Liu[18], and Che[10]. Huang and Liu used tree kernel to extract Chinese relations. Che modified edit distance as an improved string kernel, which considers the context of words instead of only strings.

Open Relation Extraction

Instead of training one model for a relation, open relation extraction (ORE) learns multiple relations at the same time. Qiu[24] established a Chinese ORE system, featuring in iterative semantic tagging. Chen[11] learns Chinese relations by defining soft constrains and adopted maximum-entropy method to the relation extraction system.

(31)

Chapter 3 Methodology

The first section defines the problem of relation extraction. Then the framework for solving this problem is presented in the next section. The framework includes four key components: data labeling, feature extracting, model learning, and iterative training. The four components are explained in the following four sections.

3.1 Problem Definition

Considering the scenario of relation extraction, given a set of entity pairs as seeds indicat- ing a relation, we are going to extract new pairs representing such relations from a corpus.

The details are described in this section and begin with defining the notations used in this thesis.

3.1.1 Notations

First, we let C denote a corpus. Each s ∈ C is a sentence, which is constructed by words. Given a corpus C, an entity set is defined as E = {e | e is a word in C}. Then we let R denotes a relation set. Each r ∈ R is a relation, corresponding to a seed set Sr = {(ei, ej) | ei, ej ∈ E}, r ∈ R. The tuple (ei, ej) ∈ Sr indicates that 2 entities ei

and ej are semantically connected with the relation r. In this problem, a new pair set is defined as N_r ={(ei, e_j)| ei, e_j ∈ E; (ei, e_j) /∈ Sr}, r ∈ R.

(32)

3.1.2 Relation Extraction Problem

Given a corpus C and a seed set S_r, the relation extraction system will create a new pair set N_r. The pairs in N_rare extracted from C and excluded from S_r.

• Input: a corpus C, a seed set S_r ={(ei, e_j)| ei, e_j ∈ E}, r ∈ R

• Output: a set of new pairs N_r={(ei, e_j)| r ∈ R; ei, e_j ∈ E; (ei, e_j) /∈ Sr}, r ∈ R For example, to extract new pairs related to the relation AtLocation in a corpus C, the corpus C and a seed set S_AtLocationare given as following. The sentences in C are selected from Wikipedia.

• Seed set SAtLocation

(Taipei, Taiwan) (Tokyo, Japan)

• Corpus C

Taipei City is the capital city and a special municipality of Taiwan.

Tokyo is the capital and largest city of Japan.

Seoul Special City is the capital and largest metropolis of South Korea.

Beijing is the capital of the People’s Republic of China.

The new seed pairs related to AtLocation are extracted from C, as shown in N_AtLocation.

• New seed set N_AtLocation (Seoul, South Korea) (Beijing, China)

3.2 Framework

The overall framework of the relation extraction system is shown on Figure 3.1, and the process is defined as Algorithm 1. The framework are separated into 3 parts: bag gener- ator, relation predictor and pair evaluator.

(33)

Algorithm 1 Overall process of relation extraction

Input: a set of seeds Sr⁽¹⁾, a corpus C, an set of entities E, maximal iteration number M Output: a set of new pairs N_r

1: generate a unlabeled pair set U ={(ei, e_j)| ei, e_j ∈ E} from C

2: for t = 1 to M do

3: generate a labeled bag set B_label^(t) from C and Sr^(t) with Bag Generator

4: generate a unlabeled bag set B_unlabel^(t) from C and U with Bag Generator

5: train a model Relation Predictor with B_label^(t)

6: with the Relation Predictor, predict labels for all data in B_unlabel^(t)

7: select positive pairs from B^(t)_unlabelas Nr^(t)

8: generate new seed set Sr^(t+1)from Nr^(t) by Pair Evaluator

9: end for

10: return Nr⁽¹⁾∪ Nr⁽²⁾∪ ... ∪ Nr^{(M )}

Figure 3.1: Framework of the relation extraction system

(34)

3.2.1 Bag Generator

The bag generator aims at mapping pairs to bags. As the example in Figure 3.2, a bag of the pair (Taipei, Taiwan) consists of sentences from the corpus mentioning Taipei and Taiwan. The bag generator not only groups sentences in a bag, but also transforms sentences to feature vectors. Given any pair (ei, ej), ei, ej ∈ E and a corpus C, a bag b is generated with sentence s mentioning e_i and e_j. Thus, b ={v | v = f(ei, e_j, s)}, where f is the function transforming the sentence with entity pair to feature vector. The details of features are described in Section 3.3.

The input and output of a bag generator is defined as following:

• Input: a corpus C, an entity pair (e_i, e_j), e_i, e_j ∈ E

• Output: a bag b associated with (e_i, e_j)

With the bag generator, a set of seeds will be mapped to a set of bags. The label of seeds will be brought to the corresponding bags. We define the process as “Automatic Labeling” and more details are characterized in Section 3.4.

3.2.2 Relation Predictor

The relation predictor is used for generating new pairs from the corpus as a standard ma- chine learning process. With labeled bag set B_label and an algorithm A, the predictor is created to predict the label of each bag b∈ Bunlabel.

In this work, the algorithm A is a multiple instance learning algorithm due to the restriction of the problem. More details of the multiple instance learning process are described in Section 3.5.

The input and output of the relation predictor is defined as following:

• Input: a labeled bag set B_label, a unlabeled bag set B_unlabel, a learning algorithmA

• Output: a set of new pairs N

(35)

Figure 3.2: In this Figure, there is a corpus on the top. The corpus contains sentences selected from Wikipedia. And there are four bags representing four pairs: (Taipei, Taiwan), (Tokyo, Japan), (Seoul, Korea), (Beijing, China). The sentences in each bag are from the corpus and are illustrated on the right side of the bag.

(36)

3.2.3 Pair Evaluator

To iteratively learn new pairs from the corpus, we update the seed set for each iteration.

To avoid using the false positive pairs as seeds in the next iteration, the result should be evaluated by another mechanism. Here we use human intelligence as the evaluator.

Given the new pair set Nr^(t) generated in the t^th iteration, we ask human to evaluate the correctness and generate another set Sr^(t+1), which is the seed set in the next iteration.

The details of this process are illustrated in Section 3.6.

The input and output of the pair evaluator is defined as following:

• Input: a set of candidate pairs Nr^(t)

• Output: a set of confident pairs Sr^(t+1)

3.3 Features of Data

When generating the training data, we transform plain texts to features. Zhou[32] has addressed several NLP features for feature-based relation extraction, considering bag of words, parse tree, entity type, and other features. For distant supervision relation extraction, Mintz[22] considered lexical and syntactic features. Following the features used by Zhou and Mintz, we consider the properties of data and decide the features as Table 3.1.

The example in Table 3.1 should be referred to the following sentence and entity pair:

• Sentence

昨天下午台灣選手周天成在台北羽球公開賽擊敗林丹

(In the afternoon yesterday, the Taiwanese athlete Tien-Chen Chou defeated Dan Lin in Yonex Open Chinese Taipei)

• Entity pair

台北 (Taipei), 台灣 (Taiwan)

(37)

Feature Name Explanation Example Textual Features

W1 the first entity 台灣

W2 the second entity 台北

WBNULL if no word in between False

WBFL the only one word in between if only one

word in between Null

WBF first word in between when at least two

words in between 選手

WBL last word in between when at least two

words in between 在

WBO other words in between except the first

and last word {周天成}

BW1F first word before the first entity 下午 BW1L second word before the first entity 昨天 AW2F first word after the second entity 羽球 AW2L second word after the second entity 公開賽

POS-Tag Features

POSFL the pos tag of the only word if only one

word in between Null

POSF pos tags of the first word in between if at least two words in between Na POSL pos tags of the last word in between if at

least two words in between P POSO pos tags of other words in between except

the first and last word {Nb}

POS1F pos tag of the first word before the first

entity Nd

POS1L pos tag of the second word before the first

entity Nd

POS2F pos tag of the first word after the second

entity Na

POS2L pos tag of the second word after the sec-

ond entity Na

Syntactic Features

TREE12 parse tree path between the two entities NR← NP ← IP → VP

→ PP → NP → NR DEP1 bag of dependency tag related to the first

entity {nn}

DEP2 bag of dependency tag related to the sec-

ond entity {nn}

DEP12 dependency tag between the two entities Null Others

ORDER order of the entity pair (A,B) B→ A

Table 3.1: List of features, considering the example sentence “昨天下午台灣選手周天成在台北羽球公開賽擊敗林丹” and entity pair (“台北”, “台灣”).

(38)

昨天下午台灣選手周天成在台北羽球公開賽擊敗林丹

Nd Nd Nc Na Nb P Nc Na Na VC Nb

Table 3.2: Example of feature - part-of-speech tags. Red words are entity pair.

Dependency Entity 1 Entity 2

nn 下午昨天

tmod 擊敗下午

nn 選手台灣

nsubj 擊敗選手

advmod 擊敗周天成

case 公開賽在

nn 公開賽台北

nn 公開賽羽球

prep 擊敗公開賽

root ROOT 擊敗

dobj 擊敗林丹

Table 3.3: Example of feature - dependency. Red words are entity pair.

The part-of-speech tags¹, dependencies²and parsing tree³of this example is illustrated in Table 3.2, Table 3.3 and Figure3.3 respectively.

1CKIP Word Segmenter: http://ckipsvr.iis.sinica.edu.tw/

2Stanford Parser: http://nlp.stanford.edu/software/lex-parser.shtml

3parsed by Standford Parser; visualized with Syntax Tree Generator: http://mshang.ca/syntree/

Figure 3.3: Example of feature - parsing tree. Words in red circles are entity pair.

(39)

3.4 Assistant Labeling

Since the size of labeled bags is usually very large, it is difficult to label every data manually. In this section, we propose a process to automatically label the bags. We follow the distant supervision assumption of relation extraction provided by Mintz[22]:

Assumption 1. If two entities participate in a relation, all sentences that mention these two entities express that relation.

With Assumption 1, we could match any seed (e_i, e_j) ∈ Sr to sentences mention- ing e_i and e_j, and assign the sentences with corresponding label r. But this assumption is too strong because not all sentences containing (e_i, e_j) express the relation r. So Hoff- mann[14] addressed a modified assumption by adapting the problem to a multiple instance learning problem.

Assumption 2. If two entities participate in a relation, at least one sentence that mentions these two entities might express that relation.

According to Assumption 2, given any seed (e_i, e_j)∈ Srand a bag of sentences men- tioning e_i and e_j, at least one sentences in the bag might express r. Instead of describing a relation with a sentence, here we use a bag to represent a relation. The Assumption 2 is modified as following:

Assumption 3. If two entities participate in a relation, given a bag of sentences that mention these two entities, at least one sentence in the bag might express that relation.

A seed set S_r = S_r⁺ ∪ S_r⁻, where S_r⁺ contains entity pairs with relation r and S_r⁻ contains entity pairs without relation r. Any entity pair (e_i, e_j) in the seed set S_r may correspond to a bag of sentences b⊂ C, where each sentence s ∈ b contains the 2 enties ei and ej. If (ei, ej) ∈ Sr⁺, then y = +1. Otherwise, if (ei, ej) ∈ Sr⁻, then y = −1.

The process of automatic labeling for an entity pair (ei, e_j) and the corresponding set b is displayed in Algorithm 2.

(40)

Algorithm 2 Process of automatic labeling for training data

Input: a corpus C ={s}, a seed set Sr ={(ei, e_j) | ei, e_j ∈ E} = S_r⁺∪ S_r⁻, r∈ R, f (ei, ej, s) transforms sentence and entities to features

Output: a labeled bag set Blabel ={(y, b)}

1: initial a empty bag set B

2: for all p = (ei, ej)∈ Sr do

3: generate a sentence set: b_p ={v | v = f(ei, e_j, s); e_i, e_j are words in s}

4: assign the label for b_p: yp = +1 if (ei, ej)∈ Sr⁺

y_p =−1 if (ei, e_j)∈ Sr⁻

5: add (y_p, b_p) to B

6: end for

7: return B

3.5 Multiple Instance Learning

Since we adopt the assumption of multiple instance learning (MIL) to the problem, we choose the algorithm of multiple instance classifier as our core process in the system.

This section describes several MIL methods which are applied in the experiment.

3.5.1 Single Instance Learning: a Naive Approach

One of the naive algorithm of MIL learns with instances in the bags. The instances used for training are labeled according to the bags they belong to. Comparing with multiple instance learning, this approach is named as “Single Instance Learning (SIL)”. Without violating the assumption of MIL, instances in negative bags are certainly labeled as negative. However, instances in positive bags are all regarded as positive, which causes the negative instances to be mislabeled as positive. Even though suffering from mislabeling, SIL performs well in some problems. Soumya[25] provided an empirical comparison to show that SIL is superior to some MIL algorithms.

3.5.2 Semi-Supervised Approach

MissSVM (Multi-Instance learning by Semi-Supervised Support Vector Machine) is an approach proposed by Zhou[33], which combines multiple instance learning with semi- supervised learning. MIL learns by bags, which contain unsure labels; semi-supervised

(41)

learning learns from both labeled and unlabeled data. Hence, MIL could be viewed as a special case of semi-supervised learning.

Zhou[33] gives the definition to MIL by unfolding the instances from bags. X is the instance space. A bag is defined as Xi = {xi,1, xi,2, ..., xi,ni}, xi,j ∈ X , with ni, the length of Xi. The training bags are concatenated by placing negative bags before positive bags. The bag set is{X1, X₂, ..., X_q, X_q+1, ..., X_q+p₋₁, X_m}, where X1to X_qare negative bags and X_q+1to X_mare positive bags. All bags are unfolded without changing the order. The instance set is{x1,1, x_1,2, ..., x_1,n₁, x_2,1, ..., x_m,1, x_m,2, ..., x_m,n_m}. Then the set is re-indexed as{x1, x₂, ..., x_T_L, ..., x_T}, in which T = ∑m

i=1n_i and T_L = ∑q i=1n_i. The original bag set is transformed to an instance set. With the instance set, the problem is defined as Definition 1.

Definition 1. Given a set of labeled negative instances{(x1,−1), (x2,−1), ..., (xT_L,−1)}

and a set of unlabeled instances{xTL+1, x_T_L₊₂, ..., x_T}, to learn the target function F^s : X → {+1, −1} s.t. each positive bag Xicontains at least one positive instance.

The definition is a semi-supervised task with a constraint. For any unseen bag X_∗ = {x∗,1, x_∗,2, ..., x_∗,n_∗}, the prediction function F is defined as:

F (X_∗) = +1, if there exists a j ∈ {1, 2, ..., n∗} s.t. F^s(x_∗,j) = +1 F (X_∗) =−1, otherwise

(3.1)

After formulating the MIL problem to semi-supervised learning problem, semi-supervised learning algorithm could be applied to solve the MIL problem.

3.5.3 Multiple-Instance Classification Algorithm

MICA (multiple-instance classification algorithm) is an algorithm proposed by Mangasar- ian[20], solving MIL problem by representing each bag as the convex combination of instances in the bag. And then linear programming is adopted to the solution to predict the unseen bags.

Given a positive bag, which is a set of instances X = {x1, x₂, ..., x_n}, there is a set of coefficients α = {α1, α₂, ..., α_n}, 0 ≤ αi ≤ 1, ∑_n

i=1α_i = 1 such that x = α^TX.

(42)

In the coordinates space, x is a node on the convex combination exterior, and a decision boundary that separates x as the positive side is found. The objective function realizes the MIL assumption.

3.5.4 Support Vector Machine for Multiple Instance Learning

Andrew[3] raised the algorithm “Support Vector Machines (SVM) for Multiple-Instance Learning (MIL)”, which contains 2 types of SVM for MIL problem: mi-SVM and MI- SVM. Both deal with data in bag format, but are different in the optimization target. MI- SVM optimizes on the bag level while mi-SVM optimizes on the instance level.

According to the assumption of MIL, all instances in negative bags are definitely negative. In positive bags, the assumption only tells that at least one instance is positive;

therefore, all the labels of instances in positive bags are unknown. The assumption is defined as Definition 2 by Andrew and formulated as a linear constraint, which is illustrated in Formula (3.2).

Definition 2. Given a bag set B ={X | X is a bag}, where each X = {x} is associated with a label Y ∈ {1, −1}. Each instance x ∈ X carries a label y. When YI = −1, all instances in X_I are labeled with−1. If YI = 1, the instance labels are not ensured but constrained with YI = maxxi∈XIyi.











∑

i∈I

y_i+ 1

2 ≥ 1,∀I s.t. YI = 1 y_i =−1,∀I s.t. YI =−1

(3.2)

The two types of SVM for MIL are explained with the notations used in SVM[27].

mi-SVM

mi-SVM aims at maximizing instance margin. The process learns the best labeling pat- terns in positive bags. Not violating the constraint defined in Formula (3.2), each instance in positive bags is assigned with a label y∈ {1, −1}; on the other hand, instances in neg- ative bags are all assigned with y = −1. The support vector machine learns a model w

(43)

based on the assigned pattern, objecting to the SVM objective function. The solution for mi-SVM is the model w with maximal margin of instances and is defined as Formula (3.3)

w = arg min

{yi}

min

w,b,ξ

1

2∥w∥²+ C∑

i

ξ_i

s.t.∀i :yi(⟨w, xi⟩ + b) ≥ 1 − ξi, ξ_i ≥ 0, yi ∈ {−1, 1}

(3.3)

MI-SVM

MI-SVM aims at maximizing bag margin. Each positive bags are represented by an instance (witness) in the bag and all negative bags are expanded as negative instances. The best witness in positive bags are selected by optimizing the SVM objective function. The solution of MI-SVM is defined as Formula (3.4), where s is a selector to decide the wit- ness.

w = arg min

s

min

w,b,ξ

1

2∥w∥²+ C∑

I

ξ_I,

s.t. ∀I :YI =−1 ∧ −⟨w, xi⟩ − b ≥ 1 − ξI,∀i ∈ I or Y_I = 1∧ ⟨w, xs(I)⟩ + b ≥ 1 − ξI, and ξ_I ≥ 0.

(3.4)

3.5.5 Multiple Instance Learning for Sparse Positive Bags

The series of sparse positive MIL algorithms deal with the data where few positive instances are in positive bags. Bunescu[9] proposed three scenarios of sparse positive MIL, which are explained as following:

Sparse MIL

Sparse MIL (sMIL) modifies the constraint of SIL. In SIL, every instance in negative bags is regarded as negative, and the same as positive bags. But sMIL assumes that few instances in positive bags are really positive, so it favors the situation that few positive instances exist in positive bags. sMIL also models for large bags. It looses the constraint when bag size is large because it is not easy to find a positive instance for a sparse positive bag. sMIL is equivalent to SIL when there is only one instance in the positive bag.

(44)

Sparse Transductive MIL

The transductive SVM modifies the standard SVM to a constrained version, where the decision boundary is assumed as far from the unlabeled data as possible. In the problem of MIL, instances in positive bags could be viewed as unlabeled instances since the assumption “at least one instance in the positive bag is positive” indicates that the labels in positive bags are unsure. Sparse Transductive MIL (stMIL) replaces the original SVM with transductive SVM.

Sparse Balanced MIL

Sparse balanced MIL (sbMIL) takes advantages of SIL and sMIL. The former favors on rich positive bag while the latter favors on sparse positive bag. sbMIL adopts the parameter η to decide the percentage of positive instances in positive bags, and then trains a decision function as sMIL with the original bags. And the instances in positive bags are adjusted with the distribution decided by η. Given the result predicted by sbMIL and ranked by the score, the top (η× positive bag size) instances in a positive bag are regarded as positive and the rest in the bag are negative. The rearranged bags are used for training the final decision function with SIL and for predicting the final results.

3.6 Evaluation Process

The relation extraction system learns new seeds iteratively. In each iteration, a set of new pairs are extracted from the corpus. New seeds are selected from these pairs to train a model in the next iteration. To generate new seeds efficiently, we propose a 2-step selection mechanism. First, the system ranks the result and the most confident candidates are generated automatically. Then human evaluation is the second step, with the com- monsense to judge the reasonableness. The top N seeds selected by algorithm will be processed to human. Then humans evaluate the candidate pairs and process the pairs with labels to the next iteration.

(45)

Chapter 4 Experiment and Result

This chapter shows the experiment of Chinese relation extraction with proposed method.

First, the source of data and generating process are explained. Then the second section illustrates the experiment setting. Section 4.3 describes the evaluation method. In the last section, we discuss the performance of the system.

4.1 Dataset

Distant supervision, which is a semi-supervised learning framework, uses small size of labeled data and huge size of unlabeled data for training. In relation extraction work, the labeled data are usually from an existing dataset and the unlabeled data are from a huge corpus containing many sentences, such as news corpus or Wikipedia. In this thesis, we use Chinese entity pairs from ConceptNet as labeled data and sentences in Taiwan Sinica Corpus as unlabeled data.

4.1.1 ConceptNet

ConceptNet[19] is developed by MIT Media Lab, containing only English data at begin- ning. It is a directed graph, with node representing concept, and link representing relation, describing commonsense knowledge. In ConceptNet, a commonsense fact is called an as- sertion, which is stored as 2 nodes (concepts) and 1 link (relation) in the graph. The knowledge in ConceptNet is originally provided by humans as crowdsourcing. After-

(46)

Relation Name Number of Assertions

AtLocation 32816

CausesDesire 19408

HasProperty 6822

NotDesires 23930

UsedFor 13548

Causes 77336

HasSubevent 40655

PartOf 6159

Desires 21772

IsA 16094

HasFirstSubevent 12046

MadeOf 16357

CapableOf 27444

SymbolOf 4736

MotivatedByGoal 56636

Table 4.1: List of relations in Chinese ConceptNet.

wards, the authors imported knowledge from other datasets such as WordNet, Wikipedia, Wikitionary, DBPedia, and other sources¹. From ConceptNet 3, by collaborating with universities from different countries, ConceptNet absorbed data from other languages and became a multilingual commonsense dataset. Chinese assertions are collected from an online pet game, by feeding the pets new commonsense knowledge.

Before ConceptNet 4, the relations of ConceptNet is predefined. When collecting knowledge in Chinese, only 15 relations are considered (shown in Table 4.1). The de- scriptions and examples of the relations could be referred in Appendix B.

In the experiment, each assertion is regarded as labeled data when training a distant supervision model. Since Chinese assertions in ConceptNet are generated from online users, the reliability is not guaranteed. By taking the relation AtLocation as an example, among all 32816 assertions, only about 37% are valid. Table 4.2 shows the correct and incorrect pairs, which exist in ConceptNet.

1knowledge sources of ConceptNet https://github.com/commonsense/conceptnet5/wiki/

Knowledge-sources

(47)

Relation: AtLocation label entity (object) entity (location)

O 教授 (professor) 研究所 (graduate school)

O 員工 (staff) 公司 (company)

O 人民 (people) 台灣 (Taiwan) O 學生 (student) 學校 (school) O 病人 (patient) 醫院 (hospital)

O 鯉魚 (carp) 池塘 (pond)

X 美國 (USA) 亞洲 (Asia)

X 程式 (program) 電腦 (computer)

X 太陽 (sun) 夏天 (summer)

X 觀眾 (audience) 電視 (television)

Table 4.2: Example of correct (label “O”) and incorrect (label “X”) pairs representing Chinese commonsense in ConceptNet.

4.1.2 Sinica Corpus

Sinica Corpus² is a Chinese corpus developed from 1994. This corpus contains articles from 1981 to 2007, spreading in different fields: philosophy, science, society, art, life, and literature. The sources of articles includes news, book, textbook, magazine, and so on. The diversity of sources shows that the corpus contains different styles of articles.

Each article is stored as sentences after being segmented with comma, period, semicolon, question mark and exclamation mark. In Chinese articles, the usage of comma is slightly different from English. Sometimes, the sentence before and after the comma could be independent in grammar while they are dependent in meaning.

The corpus is separated as about 600,000 sentences and each sentence is segmented as words. Each word is annotated with a part-of-speech tag. Since the articles of Sinica Cor- pus crosses many years, the phrasing may be slightly different to the wording nowadays.

Even so, the wording is more conscientious and careful because the sources are formal media. The example of sentences are shown on Table 4.3

The experiment uses the sentences as corpus. Each sentence is regarded as a list of words and each 2 words in the sentence may convey one or no relation.

2中研院漢語平衡語料庫 (Academia Sinica Balanced Corpus of Modern Chinese): http://rocling.iis.

sinica.edu.tw/CKIP/engversion/20corpus.htm

(48)

Sentence Translation

民族學 (Na) 研究所 (Nc) 應 (D) 主持 (VC) 之 (DE)

「(PARENTHESISCATEGORY) 台灣 (Nc) 與 (Caa) 東南亞 (Nc) 土著 (Na) 文化 (Na) 與 (Caa) 血緣 (Na) 關係 (Na) 」(PARENTHESISCATEGORY) 主題 (Na) 研究 (Nv) 計劃 (Na) 之 (DE) 需要 (Na) ，(COMMACAT- EGORY)

According to the need for host- ing the research project “aborig- inal culture and blood relationship bwtween Taiwan and Southeast Asia”, Institute of Ethnology ...

邀請 (VC) 蘇聯 (Nc) 國家 (Na) 科學院 (Nc) 世界 (Nc) 文學 (Na) 研究所 (Nc) 研究員 (Na) Ｂｏｒｉｓ (FW) Ｐａｒｎｉｃｋｅｌ (FW) 教授 (Na) 於 (P) 六月 (Nd) 十九日 (Nd) 至 (P) 廿六日 (Nd) 來訪 (VA) ，(COM- MACATEGORY)

... invited the researcher, Professor Boris Parnickel, who comes from Institute of World Literature of Rus- sian Academy of Sciences, to visit from 19th to 26th June.

Ｂｏｒｉｓ (FW) 教授 (Na) 之 (DE) 專長 (Na) 為 (VG) 東南 (Ncd) （(PARENTHESISCATEGORY) 特別 (VH) 是 (SHI) 馬來亞 (Nc) 　及 (Caa) 印尼 (Nc) ）(PAREN- THESISCATEGORY) 的 (DE) 神話 (Na) 傳說 (Na) 及 (Caa) 民俗 (Na) ，(COMMACATEGORY)

Professor Boris specializes in the legend and folklore in Southeast Asia, especially in Malaysia and In- donesia, ...

在 (P) 此 (Nep) 一 (Neu) 領域 (Na) 已 (D) 有 (V_2) 傑出 (VH) 之 (DE) 研究 (Nv) 成果 (Na) 。(PERIODCAT- EGORY)

... having excellent achievements in this area.

Table 4.3: Example of sentences in Sinica Corpus.

4.2 Experiment Setting

4.2.1 Multiple Relations

There are 15 relations in Chinese ConceptNet, as shown in Appendix B. Among the 15 relations, we select four relations from ConceptNet and compare the results of relation extraction. The properties of the four relations are illustrated in Table 4.4. The assertion number of each relation depends on the data collection policy of ConceptNet. The seeds are generated from ConceptNet assertions and Sinica Corpus. For each assertion, if the two entities occur in the same sentence, the entity pair is considered as a seed for further labeling process. Since our system learns with bags of instances, to assure the bag size, entity pairs occurring more than 10 times are used in practice. These seeds are labeled as positive, negative or invalid manually. The numbers of positive seeds and negative seeds applied in the experiment are also shown in Table 4.4. The last column in Table 4.4 shows the candidate pair type, which is the part-of-speech tag of entity pairs, providing a heuristic to select unlabeled pairs from sentences.

誌謝