利用語法結構之雙向遞迴類神經網路於命名實體辨識之研究

(1)

國立台灣大學電機資訊學院資訊工程學系碩士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

利用語法結構之雙向遞迴類神經網路於命名實體辨識之研究

Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks

李朋軒 Peng-Hsuan Li

指導教授：許永真博士馬偉雲博士

Advisor: Jane Yung-jen Hsu, Ph.D.

Wei-Yun Ma, Ph.D.

中華民國 106 年 11 月

November, 2017

(2)

(3)

Acknowledgments

感謝許永真老師指導我研究。大學部時，許永真老師和王詩翰學長帶領我探索

各個研究問題與技術，引領我完成一個研究專題。碩士班時，許永真老師給予我各個研究方向的回饋，也給我許多碩士論文撰寫建議。

感謝馬偉雲老師指導我研究。碩士班時，馬偉雲老師帶領我探索一個研究領

域，指導我在一個具體的研究題目上得到成果，並教導我投稿和撰寫會議論文。

至今，馬偉雲老師持續給我進一步研究方向建議和回饋。

感謝實驗室的同學們合作討論，感謝我的家人，感謝許許多多的人在我成長過

程中帶來正面影響。

i

(4)

(5)

Abstract

Named Entity Recognition (NER) is an important task which locates proper names in text for downstream tasks, e.g. to facilitate natural language under- standing. The problem is often casted from structured prediction of text chunks to sequential labeling of tokens. Such sequential approaches have achieved high performance with models like conditional random fields and recurrent neural networks. However, named entities should be linguistic constituents, and sequential token labeling neglects this information.

In the thesis, we propose a constituency-oriented approach which fully utilizes linguistic structures in text. First, to leverage the prior knowledge of hierarchical phrase structures, we generate parses and alter them into constituency graphs that minimize inconsistencies between parses and named entities. Then, we use Bidirectional Recursive Neural Networks (BRNN) to propagate relevant structure information to each constituent. We use a bottom-up pass to capture the local information and a top-down pass to capture the global information. Experiments show that this approach is comparable to sequential token labeling, and significant improvements can be seen on OntoNotes 5.0 NER, with F1 scores over 87%.

iii

(6)

命名實體辨識(NER)是一個找出文字中的命名實體的重要任務，其產出能提供給下游的任務比如自然語言理解使用。此問題常從命名實體所在的文字區段的預測被轉型為線性地預測每個單詞是否屬於某一命名實體的一部分。利用CRF與RNN等模型，這類轉型後的方法取得了很好的成果。然而，每個命名實體都應該是一個語法單元，而線性單詞預測的方法忽略這個資訊。

在本論文中，我們提出一個語法導向的方法以完整利用文字裡的語言結構。要利用階層性的詞組結構，我們首先產生語法剖析樹並將之改變為最小化剖析樹與命名實體之間的不一致的語法圖。然後我們利用雙向遞迴類神經網路(BRNN)去傳遞相關的結構資訊到每一個語法單元。我們利用一個由下往上的遍歷來蒐集局

部資訊，以及一個由上往下的遍歷來蒐集全域資訊。實驗顯示此方法可和線性單

詞標記法相比，並在OntoNotes 5.0 NER語料上取得了超過87% F1分數的顯著進

步。

(7)

List of Figures

1.1 The NER task and a sequential labeling approach with BIOES se-

quential labels (Begin, Inside, Outside, End, Single). . . . 2

1.2 Linguistic structures and NER. . . 3

2.1 Recurrent vs. Recursive NN. . . 10

3.1 An NER system. . . 14

3.2 Comparing classical sequential labeling NER and constituency-oriented NER. . . 15

3.3 High-level overview of the system. . . 16

4.1 Types of inconsistencies between Parses and NER. . . 21

4.2 A parse tree-derived constituency graph for S₁. . . 24

4.3 Applying Algorithm 1 for S₁. . . 24

4.4 Wrong grouping for S₁. . . 25

4.5 Another example of type-2 inconsistency. . . 26

4.6 A na¨ıve pyramid. . . 27 ix

(12)

4.9 A dependency parse tree with bottom-up links. . . 30 4.10 The transfromed tree for Figure 4.9. Arrows indicate head nodes. . . 32

5.1 Binarized tree for S₁ = (senator, Edward, Kennedy). . . 35 5.2 The bottom-up hidden layer applied to node#5 in Figure 5.1. . . 45 5.3 The top-down hidden layer applied to node#5 in Figure 5.1. . . 45 5.4 The bottom-up hidden layer applied to the graph in Figure 5.1. . . . 46 5.5 The top-down hidden layer applied to the graph in Figure 5.1. . . 46 5.6 The top-down deep hidden layer applied to node i with parent p. . . . 47 5.7 The output layer applied to node i with left sibling j and right sibling

k. The fully-connected layer does not contain non-linear transforma- tions like ReLU. . . 49

6.1 The parse tree of a sentence containing White House (FACILITY). . 62 6.2 The parse tree of a sentence containing Koran (WORK OF ART). . . 66

(13)

List of Tables

5.1 An example of word-to-index mapping. φ represents a non-existent

token. . . 35

5.2 An example of pos-to-index mapping. φ represents a non-existent parse tag. Added pyramid nodes all have the same sepcial tag. . . . 38

5.3 Three example lexicons of PER, ORG, and LOC. . . 39

6.1 Parameters of BRNN-CNN. . . 52

6.2 Dataset statistics for CoNLL-2003. . . 54

6.3 Dataset statistics for OntoNotes 5.0. . . 55

6.4 Trial range and final settings of hyper-parameters. . . 56

6.5 Experiment results on CoNLL-2003. . . 59

6.6 Experiment results on OntoNotes. *Finkel and Manning used gold parses in training time. . . 59

6.7 Experiment results on different data sources of OntoNotes. *Percent- age of NEs that correspond to some constituents in binarized auto parses. . . 60

xi

(14)

6.10 Performance on OntoNotes before and after binarization. . . 63 6.11 Performance of unidirectional and bidirectional models on OntoNotes. 65

(15)

Chapter 1 Introduction

In this chapter, we start off introducing the background of the thesis. Then our motivation and goals are given. Finally, the following chapters in the thesis are briefly summarized.

1.1 Background

Named Entities (NEs) are text chunks that represent names, and they are some- times simply referred to as names or entities. The types of names that are often wanted to be recognized include PERSON, ORGANIZATION, and LOCATION.

While in specific domains such as biomedicine, each molecule can be seen as a cat- egory of named entities, many other categories of general named entities have also been proposed, e.g. WORK OF ART and LAW.

Named Entity Recognition (NER), which can be seen as a combined task of lo- 1

(16)

cating and classifying named entities, is an important task of information extraction systems. Recent important benchmark datasets of the general domain include the dataset of CoNLL 2003 shared task [31] and the dataset of the OntoNotes project [13]. CoNLL 2003 is the Reuters corpus with NER annotations, and OntoNotes 5.0 boasts multilevel annotations, e.g. TreeBank, PropBank, and NE, for diverse sources of texts.

NER problems are often casted from structured prediction of text chunks to sequential labeling of tokens (Figure 1.1). This is done by labeling each token as a part of a named entity chunk, e.g. “Begin Person”. Such approaches achieve high performances in the benchmark datasets [25, 22, 3].

Figure 1.1: The NER task and a sequential labeling approach with BIOES sequential labels (Begin, Inside, Outside, End, Single).

Being formulated as a sequential labeling problem, NER systems could be im- plemented by models which compute hidden states for each token. These hidden state features are then used to predict the sequential label of a token. Such kind of

(17)

1.2. MOTIVATION 3

models include conditional random fields and recurrent neural networks. With both forward and backward directions, bidirectional networks learn how to propagate the information of a token sequence to each token. Bi-LSTM-CNN, a variant of such models, is shown to accomplish state-of-the-art results on both CoNLL 2003 and OntoNotes 5.0 NER [3].

1.2 Motivation

According to analyses, most named entity chunks are actually linguistic constituents, e.g. noun phrases, and additional linguistic information other than word orders should be intuitively useful (Figure 1.2). However, due to the hierarchical chunking nature of phrase structures, it is intrinsically hard for sequential labeling token-based NER models to take advantage of them. Unfortunately for constituent- based NER models, the inconsistencies between constituency parses and named entities pose another challenge: the recall of such models is capped by the proportion of named entities that correspond to some constituents

Figure 1.2: Linguistic structures and NER.

(18)

Provided that both parse trees and NEs are given, they can be made consistent by flattening the trees and then adding new nodes [10]. However, this condition is not practical for NER systems that are dependent on parses. Instead, for any NER training corpora with or without constituency parse annotations, readily available parsers can be used. It is then desirable to have algorithms that can still alter those parser-generated parses to make them more consistent without actually knowing NE locations.

Additionally, as the approach shifts from sequential labeling token-based NER to tree-structured constituent-based NER, recursive neural networks should be con- sidered. Recurrent neural networks are shown to be powerful on sequential labeling NER, and recursive networks are the generalization that can operate on tree structures. To capture the relevant information for each token, bidirectional recurrent networks have two passes for left and right context respectively. For recursive networks, they could have a bottom-up pass to capture local information and a top-down pass to capture global information.

1.3 Objective

To leverage linguistic structures in texts for NER, we want to

• Mitigate the inconsistencies between parsing and NER by restructuring algorithms, and

• Utilize prior linguistic structure information with constituent-based Bidirec- tional Recursive Neural Networks (BRNN).

(19)

1.4. OUTLINE OF THE THESIS 5

1.4 Outline of the Thesis

Chapter 2 explores previous work on NER, treatments for the consistency problem, and models closely related to the proposed BRNN-CNN.

Chapter 3 first states the NER problem to solve, and then shows the overview of the system proposed by this thesis. Chapter 4 elaborates the first functional block of the system: constituency graph generation. This chapter covers the construction of linguistic structures, inconsistencies between the structures and NER, and algorithms that mitigate these inconsistencies. Finally, after linguistic structures needed are defined, Chapter 5 formulates the proposed model and the features used.

In Chapter 6, evaluation setup about tuning, training, and testing of the system on different datasets are documented. Experiment results and analyses of different aspects of the approach are given.

Chapter 7 summarizes the contribution of the thesis as well as possible future research directions.

(20)

Related Work

In this chapter, related researches of the NER problem and neural models are presented first. The last two sections then introduce two approaches that are most related to the thesis.

2.1 NER

Studies of named entity recognition can be dated back to the Message Un- derstanding Conference-6 at 1995 [12]. NER systems on this well-studied MUC-7 dataset [2] have achieved near-human performances (93% against 97%) [21]. How- ever, NER remains an active and challenging research topic to date with various complications. These include more classes of general named entities, more fine- grained categories, an indefinite number of domain-specific types, diverse sources of corpora, crowd-based external knowledge, and joint tasks of related problems

(21)

2.1. NER 7

[31, 13, 17, 7, 9].

The Conference on Computational Language Learning (CoNLL) organized by SIGNLL includes a shared Natural Language Processing (NLP) task every year. In 2003, CoNLL held a language-independent NER shared task [31]. Since then, its English corpus, the Reuters Corpus Volume 1 (RCV1) [19] annotated with sequential NE labels, has become an widely used benchmark for recent systems.

OntoNotes, a project which creates multilingual, multi-source, and substantially larger corpora with multilevel annotations, are first described in 2006 [13]. Partic- ularly, the multilevel annotations make it possible for systems that tackle different tasks in the NER pipeline to share with, compare with, or depend on one another.

In addition, joint models that try to solve multiple problems at once are made possible. For NER, OntoNotes release 5.0 [32] annotates more categories of names and numerical quantities which have wider coverage and are more fine-grained. Base- lines for various tasks as well as a train-validate-test split which later become the standard have been established for this final release [24]. In 2012, the dataset was used for the multilingual coreference shared task held by CoNLL, and has since been gaining popularity as a benchmark for NE-related tasks.

Traditionally, NER is modeled as a token-based sequential labeling problem by breaking each NE chunk into chunk labels. (An example is shown in Figure 1.1.) The most widely used chunk labels include Begin, Inside, Outside, End, and Single. In the famous BIO chunk labeling, every token that is not Outside any chunk is labeled as Inside unless it is the Beginning of a chunk. However, for the more complicated BIOES, or BILOU (Begin, Inside, Last, Outside, Unit), chunk labeling, the Last

(22)

token of a multi-token chunk is labeled as End, whereas the token of a Unit-length chunk is labeled as Single. Studies that use the latter chunk labeling scheme have been dominating pure NER tasks (as opposed to joint tasks) [25, 22, 3].

When a large corpus with multiple annotations such as OntoNotes are available, constructing models that are guided by multilevel information is then possible. In- tuitively, since all the human-labeled linguistic annotations are sound, models that are trained by more than NER labels should not perform worse than pure NER models. However, these additional labels generated by human are costly and could practically only be obtained in training time, so they cannot be used as features but targets to predict. In other words, to utilize multiple annotations, joint models that tackle several tasks at once must be trained. This kind of models are gener- ally hard to train successfully despite intrinsically having advantages against pure NER models. Joint systems for NER that were successful at their time include one that jointly predicts NE while parsing [10], and ones that does named entity typing, linking and even coreference at once [9, 20].

2.2 Related Neural Models

Neural Networks (NN), the collection of functions composed by linear combi- nations and nonlinearities, are proved to be able to approximate any continuous functions on a close interval [6]. The universality and empirical results of deep neural networks intrigue to construct end-to-end models that use raw sources of information as features for each domain, e.g. pixels for vision. In 2011, the SENNA

(23)

2.2. RELATED NEURAL MODELS 9

system that almost use raw words as features achieved near state-of-the-art performances on various NLP tasks, including part-of-speech (POS) tagging, chunking, NER, and semantic role labeling, at its time [5].

The actual raw features in text are the sequences of characters, which means the usage of a pre-trained word segmentation system already introduces noises and losses of information. However, for synthetic languages that has multiple phonemes per word like English, training a true end-to-end model is intrinsically hard since each character encodes little semantic information. Still, character-level features have recently been shown to be effective in capturing morphological information inside words and combating the word sparsity problem. State-of-the-art NER systems have been achieved for Spanish and Portuguese by using both word and character embeddings [8]. Some models use word segmentation only as boundaries for the computation of character-level word embeddings by convolutional networks. Such kind of recurrent neural network language models outperforms word-level baselines for several languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian) [15].

While recurrent neural networks repeatedly apply their hidden layers to a se- quence of inputs, recursive neural networks repeatedly apply their hidden layers to a Directed-Acyclic Graph (DAG) of inputs. In other words, recursive networks are generalized recurrent networks with relaxed condition on the dependency of inputs (Figure 2.1). When applied to parse trees, the computed hidden states of each node capture the semantic composition of the corresponding constituent. Thus they have been applied to constructing parses, computing sentence embeddings for sen-

(24)

timent analysis and paraphrase detection, and computing additional features (in a top-down fashion) for tokens of a sequential model [27, 30, 28, 29, 26, 14].

Figure 2.1: Recurrent vs. Recursive NN.

2.3 Constituents and NER

A complete sentence consists of phrases that are organized in a hierarchical structure. The constituency parse of a sentence is a tree of constituent nodes, where constituents are functional units in a sentence, including words, phrases, clauses, etc.

Hence, a named entity should correspond to a constituent, probably a Noun Phrase (NP). A sequential NP-based approach with linear-chain Conditional Random Fields (CRF) has been proposed as part of an ensemble for NER [33]. However, the full potential of constituency structures has not been utilized by the system.

On OntoNotes, both NER and constituency parse annotations are available, so it is proposed to do NER while parsing [10]. However, the parse and NER annotations were found to be inconsistent. NEs might cross constituent boundaries by consisting

(25)

2.4. RECURRENT NN AND NER 11

of multiple sibling constituents, or even cross tree branches by, for instance, con- sisting of multiple cousin constituents. These inconsistencies were deemed by the authors of that work as annotation errors of the parse trees and were resolved by modifying the dataset. Some subtrees are flattened and smaller constituents are regrouped according to NER annotations. Then the Context Free Grammar (CFG) for parsing was modified so that each nonterminal, e.g. NP, was further lexicalized by adding NE suffixes, e.g. NP-PERSON, NP-LOCATION. A CRF-CFG parser of this grammar was trained on the modified dataset. The method outperformed the same parser of the vanilla grammar on parsing, and surpassed a token-based linear-chain CRF on NER respectively.

2.4 Recurrent NN and NER

A hidden layer of a feed-forward neural network computes a hidden vector from its previous layer, except that the first layer computes from the raw feature vector extracted from a sample. On the other hand, a hidden layer of a recurrent neural network has two input vectors, with the additional one being the output hidden vector from the last time this very layer was applied. Essentially, such kind of networks learn to propagate useful information of previous samples to the current sample, and are suited for classifying sequences of dependent samples. A Long- Short Term Memory (LSTM) is one variation of recurrent neural networks that are oftentimes more successful.

The current state-of-the-art NER system takes a sequential labeling token-based

(26)

approach with Bi-LSTM-CNNs [3]. In the core of the model, bidirectional LSTM layers learn to propagate the information of the left and the right contexts of a token respectively. The attached CNN learns to compute character-level features to augment other raw features of a token, e.g. the word embedding. Notably, the authors crafted good lexicon features that record if a token is seen in the NE lexicons extracted from SENNA and DBpedia [18].

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Chiu and Nichols [3]

Luo et al. [20]

Durrett and Klein [9]

Finkel and Manning [10]

Ratinov and Roth [25]

dos Santos and Guimaraes [8]

Socher et al. [29]

Collobert et al. [5]

Socher et al. [28] Kim et al. [15]

Hovy et al. [13]

Tjong Kim Sang and De Meulder [31] Pradhan et al. [24]

Related NER Datasets Related NER Systems Related Neural Models

(27)

Chapter 3 Constituency-Oriented NER

The goal of the thesis is to leverage linguistic structures for NER. To this end, we propose a constituency-oriented approach where a constituency graph is generated for each sentence before structure information is utilized to classify each constituent.

3.1 Problem Statement

Let C be the set of named entity categories. Let S = {S_i} be the set of tokenized sentences. A sentence S_i is a sequence of tokens (S_i1, S_i2, S_i3, . . . , S_in) where n is the number of tokens of S_i. A named entity e = (i, (j, k), c) is the chunk of tokens (Sij, S_i(j+1), . . . , S_i(k−1)) with type c ∈ C. The NER problem are thus:

Given C, Input S,

Output a set of named entities E_a. 13

(28)

The ground truth Eg is unknown to a system in testing time. The quality of the system is determined by the score F (E_a; E_g), where F is an evaluation function.

Abstractly, the problem is to find an NER system which locates and classifies named entities in text. Additionally, the system operates in a per-sentence basis, assuming they are already tokenized. When a user gives such a sentence to the system, chunks of tokens that belong to some predefined named entities categories, as well as which categories they belong to, will be identified. An example is shown in Figure 3.1.

Figure 3.1: An NER system.

3.2 Proposed Solution

In this thesis, a constituency-oriented approach for NER is proposed. Figure 3.2 shows the constituency-oriented NER compared to traditional sequential labeling NER. Notably, NER is split into two stages in the proposed approach, of which underlying methods might be swapped independently. In the first stage, a hierarchy of constituents is constructed to take into account additional structural linguistic

(29)

3.2. PROPOSED SOLUTION 15

information. Then in the second stage, constituent-based predictions are made.

We suggest that the model underlying the second stage classifies each constituent not only by the constituent itself but also by relevant structures provided by the hierarchy.

Figure 3.2: Comparing classical sequential labeling NER and constituency-oriented NER.

Figure 3.3 shows components proposed for each stage. In constituency graph generation, a constituency parse tree is first constructed but later altered and aug- mented. These processes make it not a tree anymore, hence the name constituency graph. In constituent classification, hidden state features are computed recursively before classification. This is done by BRNN-CNN, a specially designed recur- sive network. Actual functional blocks of the system are briefly introduced in the following, and more details are elaborated in later chapters.

(30)

Figure 3.3: High-level overview of the system.

(31)

3.2. PROPOSED SOLUTION 17

The functional blocks in the first stage constructs a special directed graph, called a constituency graph, from a given tokenized sentence. Several restructuring algorithms are applied to a base parse tree to form the graph. These algorithms increase the consistency between the constituency graph and the (unknown) named entities of a given sentence while preserving linguistic structures hinted by the parse.

The functional blocks in the second stage are tasked with classifying constituents by a constituency graph. To utilize the structural linguistic information, a special recursive network, BRNN-CNN, is proposed. Crucially, two Directed Acyclic Graphs (DAGs) are formed by considering only bottom-up or top-down links in the hierarchical constituency graph. Then BRNN-CNN computes two hidden state features for each node recursively. The bottom-up pass captures the semantic composition of local information for each constituent. The top-down pass captures the global information of the structures containing each constituent. Together, the two hidden state features contain the relevant information for the identification of named entity constituents.

(32)

Constituency Graph Generation

The first stage of the proposed approach regards the construction of a special graph, called a constituency graph, for each given sentence. Initially, a constituency parse tree is constructed, either by direct constituency parsing or by transforming a dependency parse. Then the parse tree is binarized and augmented to form a hierarchical graph. The applied algorithms increase the consistency between the constituency graph and the unknown named entities of a given sentence while preserving linguistic structures hinted by the parse.

4.1 Constituency Graph

A constituency graph of hierarchical nodes is constructed for each given sentence.

The graph must meet the following conditions.

• Every node in the graph corresponds to a chunk of tokens in the sentence.

(33)

4.1. CONSTITUENCY GRAPH 19

• For every pair of nodes linked by some edges, one of the corresponding chunks contains the other.

Throughout the thesis, some terms are used as the following for simplicity.

• A chunk of tokens in the sentence that corresponds to some nodes is called a constituent.

• For every pair of nodes linked by some edges, one is designated to be the parent and the other the child such that the parent constituent contains the child constituent.

• An edge is said to be a bottom-up link if it points to the parent, otherwise a top-down link.

• Nodes of which bottom-up links point to the same parent are called siblings.

• Two siblings of which constituents are right next to each other are called the left sibling or the right sibling of each other, depending on which constituent is on the left and which is on the right.

Intuitively, a constituency graph meeting the requirements could be easily constructed from a constituency parse tree of the sentence. The set of nodes of the graph is the same as the tree. The set of edges is formed by adding both bottom-up and top-down links between every parent-child pair designated by the tree. The constituent of each leaf node is just one single token, or word, given by the parse.

For each internal node, its constituent is the concatenation of the constituents of its children.

(34)

The aforementioned parse tree-derived constituency graph is presumably the correct constituency structure of the sentence given by a parser or an annotator.

However, as we shall see, such a na¨ıve graph might not be optimal for NER and changes could take place.

4.2 Consistency

A named entity does not necessarily correspond to any constituent of a constituency graph.

Definition 4.1. If a named entity does not correspond to any node in a given constituency graph, i.e. no single constituent equals the named entity chunk, it is said to be inconsistent with the graph.

This happens frequently even for human annotated constituency parse trees.

Practically, a given parse tree might not be optimal for NER because of various kinds of inconsistent NEs.

Definition 4.2. An inconsistent named entity is said to be type-1 inconsistent if it is the concatenation of sibling constituents; type-2, otherwise.

So practically, a parse tree-derived constituency graph could be optimized for NER by resolving these inconsistencies. To achieve this without actually knowing NE locations, several algorithms are introduced in the next sections.

(35)

4.3. PARSE TREE BINARIZATION 21

(a) Type-1

(b) Type-2

Figure 4.1: Types of inconsistencies between Parses and NER.

4.3 Parse Tree Binarization

4.3.1 Observations of Type-1

Many inconsistent named entities are actually the concatenation of multiple sibling constituents. An example is shown in Figure 4.1a.

For parse tree-derived constituency graphs, type-1 inconsistent named entities only occurs when some nodes have more than two child nodes. Otherwise, suppose a named entity is type-1 inconsistent. The named entity must correspond to the concatenation of two sibling constituents. Because the concatenated chunk must equal to the parent constituent, the named entity is not inconsistent. A conflict.

Further judgement could be made from the above observation: these type-1 inconsistencies might be seen as minor parse errors or just the treebank annotation

(36)

style. Although the parser does not group some siblings correctly for NER, it does not group them wrongly. It just does not group them.

4.3.2 Binarizing Parse Trees

Grouping some siblings correctly resolves type-1 inconsistencies. However, NE locations are unknown to the system. Instead, a linguistic-compliant binarization process is applied. Algorithm 1 shows the recursive procedure used by the system.

Starting from the root node of a parse tree, the process recursively groups child nodes of which the parent has more than two children. By creating a new child to be the new parent of all original children except only one child to a side, it ensures every node has at most two children.

Essentially, for each node, the head-driven process groups children recursively around the head one, making it the deepest. The heuristic is that a head constituent is usually modified by its siblings in a near to far fashion. Practically, the head child of a node is determined by a rule-based head finder [4]. The finder decides the head for each production, i.e. the parse tags of a node and its children.

4.3.3 An Example

Let a sentence S1 = (senator, Edward, Kennedy) and a named entity e1 = (1, (2, 4), P ER). Figure 4.2 shows a parse tree-derived constituency graph for S₁. To facilitate discussion, nodes are numbered, and parse tags, head words, and con- stituents are abbreviated as pos, head, and const respectively. It is clear that

(37)

4.3. PARSE TREE BINARIZATION 23

Algorithm 1 Binarization

1: function BINARIZE(node)

2: n ← node.children.length

3: if n > 2 then

4: if HEAD-FINDER(node) 6= node.children[n] then

5: newChild ← GROUP(node.children[1..n-1])

6: node.children ← [newChild, node.children[n]]

7: else

8: newChild ← GROUP(node.children[2..n])

9: node.children ← [node.children[1], newChild]

10: newChild.pos ← node.pos

11: for child in node.children do

12: BINARIZE(child)

e₁ is inconsistent with the graph because no one node corresponds to the chunk (Edward, Kennedy). However, the chunk is actually the constituent concatenation of the siblings node#3 and node#4.

Figure 4.3 shows the application of Algorithm 1 to the parse tree of S₁. With the heuristic that node#3 (Edward) modifies the head node node#4 (Kennedy) be- fore node#2 (senator). The binarization process successfully adds a new node#5 (Edward Kennedy) that corresponds to e₁. In addition, the newly generated child node is given the same parse tag and the head word as its parent.

Effectively, binarizing parse trees eliminates type-1 inconsistencies while leaving

(38)

Figure 4.2: A parse tree-derived constituency graph for S₁.

Figure 4.3: Applying Algorithm 1 for S₁.

consistent NEs untouched. In other words, the consistency problem is guaranteed to be mitigated or stay the same, which is extremely unlikely. However, type-1 inconsistent NEs might not be completely resolved, as wrong grouping of siblings only makes some type-1 inconsistent NEs type-2. Figure 4.4 illustrates the situation when node#3 and node#4 are not siblings anymore.

(39)

4.4. PYRAMID CONSTRUCTION 25

Figure 4.4: Wrong grouping for S₁.

4.4 Pyramid Construction

4.4.1 Observations of Type-2

After binarization, all remaining inconsistent NEs are type-2. This type of inconsistent NEs cross different branches of a parse tree. In other words, a type-2 inconsistent NE is the constituent concatenation of nodes deep down different branches such that they are not siblings. An example is shown is Figure 4.1b.

On one hand, type-2 inconsistences could be seen as major parse errors. In general, every named entity should correspond to a linguistic constituent or at least the combination of some constituents. The parse tree, however, dictates that some needed constituents of an NE should not be combined at all.

On the other hand, type-2 inconsistent NEs could be seen as ungrammatical against the supposedly correct parsing. This happens when a chunk of tokens fits a name well just by chance. Sometimes the writer or the speaker does not intend to

(40)

group the tokens of a name but rather group some of them with others first.

For example, in Figure 4.1b, the speaker might just want to use Taihang as an adjective for Mountain range. However, Taihang Mountain fits too well a name not to tag by NER annotators. Note that, in modern Chinese, the name of a mountain is almost always ended by Mountain (山). Taihang Mountain (太行山) is such a case.

Another example of coincidence is shown in Figure 4.5. The speaker might want to mention the couple by first grouping their first names Bob and Mary before their shared last name Schindler. But the NER annotators might think that Mary Schindler is too good a proper name not to tag.

Figure 4.5: Another example of type-2 inconsistency.

Whichever point of view taken, the parser making mistakes or NEs being ungrammatical, this inconsistency could not be resolved solely by the parser. This is

(41)

in contrast to the cases of type-1. Without NE information, type-1 inconsistencies are mitigated by using head words determined by parses. For type-2 inconsistencies, trusting the parse is the essence of the problem.

4.4.2 Na¨ıve Pyramids

A so called na¨ıve pyramid has a node for every possible chunking of nodes. Figure 4.6 shows an example. For each node, its constituent is the concatenation of leaf node tokens in the sub-pyramid rooted at the node. While there might be no simple syntactical ways to restructure and in a sense fix the parses without knowing NE locations, this extreme alternative ditches parses and makes sure no inconsistent NEs exist.

Figure 4.6: A na¨ıve pyramid.

The apparent drawback of a na¨ıve pyramid is that no linguistic structures are present. Instead, it might be better if a parse and a pyramid are combined together into one single constituency graph. Figure 4.7 illustrates the idea.

Still, too much information might just behave like lack of valuable information.

(42)

Figure 4.7: A combined graph of a parse tree and a na¨ıve pyramid.

The nodes of a pyramid that do not already exist in parses lack linguistic information such as parse tags and head words. Moreover, the number of nodes explodes when a pyramid is added to a parse tree. Suppose there are n tokens in a sentence. For a parse tree, there will be only n leaf nodes, n − 1 2-degree nonterminal nodes, and few 1-degree nonterminal nodes practically. However, for a pyramid, there are n + (n − 1) + (n − 2) + · · · + 1: more than half of n² nodes in total. These overwhelming uninformative new nodes make it much harder for models to learn to propagate structural linguistic information. As a result, phrase structures are diluted too much and training speed becomes unpractically slow.

4.4.3 The Pyramid Addition Method

According to the above reasoning, a novel method is proposed to create a pyramid- added constituency graph. An example is illustrated in Figure 4.8.

First and foremost, the parse tree is preserved by making old links bypass new pyramid nodes. This way, valuable linguistic information is presented to a model in

(43)

Figure 4.8: The pyrmaid-added constituency graph.

the same structure as the parse-tree derived constituency graph by the untouched parse tree. Newly added pyramid nodes only act as information consumers. Their constituents consume information from original parse tree, but are fed only to other new nodes. As a result, only bottom-up links are added, while previously every line in the illustrations is bidirectional.

Second, the height of the pyramid is limited to a small constant d. Hence the total number of nodes are bounded by d × n, where n is the number of tokens in the sentence. When d is set to 3, all bigrams and trigrams in the sentence correspond to some nodes in the pyramid-added graph.

In summary, the proposed procedure of adding pyramid nodes significantly in- creases consistency while preserving linguistic structures. This is done by focusing on predicting additional short named entities, which are actually most NEs, with the aid of untouched parse trees.

(44)

4.5 Dependency Transformation

An alternative to constituency parsing for constructing a base tree is dependency parsing. However, a dependency parse is not a hierarchy of phrases. Instead, every node in such a parse corresponds to a distinct token and tagged edges give the modification relationship, or dependency, between tokens. Figure 4.9 shows a simple example. Note that it is still a tree, with every arrow pointing from a child to its parent.

Figure 4.9: A dependency parse tree with bottom-up links.

To use a dependency parse to construct a base tree, a dependency-to-constituency transformation must be applied. Now one strength of a constituency graph is that no strict grammar is required. As long as a consistent hierarchy of nodes is present, underlying models could try to learn from the structures. Algorithm 2 gives the recursive procedure used in the thesis to obtain a hierarchy of token chunks from dependencies.

The process transforms a dependency parse to a constituency parse by recursively making a new root node out of each dependency relation. Originally in a dependency parse, every node corresponds to a distinct token. In the process, this notion is

(45)

4.5. DEPENDENCY TRANSFORMATION 31

Algorithm 2 Dependency Transformation

1: function TRANSFORM(node)

2: root ← node

3: for child in CHILD-QUEUE(node) do

4: childRoot ← TRANSFORM(child)

5: if child.token.index < node.token.index then

6: root ← GROUP([childRoot, root])

7: else

8: root ← GROUP([root, childRoot])

9: root.pos ← RELATION(node, child)

10: return root

generalized to every token corresponds to a subtree headed by it. For each pair of tokens in an unprocessed relation, their current respective subtrees are grouped together by a new root node. The parent token in the original dependency parse then corresponds to the new grown subtree because it is the head token. Figure 4.10 shows the transformed tree of Figure 4.9. A transformed tree is naturally binary, and dependency links determine head child nodes and parse tags.

A detail hidden in CHILD-QUEUE is how the processing order of relations that shared the same head token is decided. In the thesis, the heuristic that a named entity is often centered around a token which is modified in a left-to-right and near-to-far fashion is used. For instance, a noun is often modified in the order of adjectives, a determiner, and a clause.

(46)

Figure 4.10: The transfromed tree for Figure 4.9. Arrows indicate head nodes.

Effectively, a dependency parser plus the proposed transformation algorithm could take on the role of generating parse tree-derived constituency graphs. This is an alternative to the functional block of constituency parsing in Figure 3.3. The alternative could prove to be useful if a good dependency parser is available.

(47)

Chapter 5 Constituent Classification

The second stage of the proposed approach is tasked with classifying constituents by a constituency graph. To utilize relevant constituent structures in classifying each constituent, we propose to use a special recursive network, BRNN-CNN, as the model underlying the stage. By following bottom-up links, the model captures the semantic composition of local information for each constituent. By following top-down links, the global information of the structures containing each constituent is captured. Together, the bidirectional passes propagate relevant information on a constituency graph for the identification of named entity constituents.

5.1 Feature Extraction

For every node in a constituency graph, local features are drawn from itself, its left sibling, and its right sibling. Features in use include words, head words,

33

(48)

and parse tags. However, features are not always available because of the following reasons.

• Absent siblings,

• Absent words for nonterminal nodes, and

• Absent words and head words for added pyramid nodes.

Should these cases happen, dummy feature values are used.

Besides, total number of tokens in a sentence is used as a global feature.

5.1.1 Word-Level Features

For word-level features, a function N₁ is first defined such that it maps each distinct token to a distinct index. Suppose there are n distinct tokens in the corpus.

Then their indices are from 1 to n. A special index 0 is mapped for non-existent tokens.

With the index mapping defined, the word-level features are extracted as the following. For each node i with left sibling j and right sibling k, its word-level features

x_i = (N₁(word_i), N₁(head_i), N₁(head_j), N₁(head_k))

where word_i denotes the word of i, and head_i, head_j and head_k denote the head words of i, j, and k respectively.

For example, suppose the word-to-index mapping used is shown in Table 5.1.

(49)

5.1. FEATURE EXTRACTION 35

For node#5 in Figure 5.1, its word-level features

x₅ = (N₁(word₅), N₁(head₅), N₁(head₂), N₁(head_φ)

= (N₁(φ), N₁(Kennedy), N₁(senator), N₁(φ))

= (0, 3, 1, 0)

where φ represents non-existent nodes and tokens.

x φ senator Edward Kennedy Bob and Mary Schindler

N₁(x) 0 1 2 3 4 5 6 7

Table 5.1: An example of word-to-index mapping. φ represents a non-existent token.

Figure 5.1: Binarized tree for S₁ = (senator, Edward, Kennedy).

(50)

5.1.2 Character-Level Features

For character-level features, every word is treated as a sequence of characters, and in the case when a token is non-existent, an empty sequence is used as its character sequence. To facilitate batch computing, words are preprocessed so that they are uniform in length. This is achieved with the aid of a special padding character. In addition, special end and start characters are used to provide boundary information for shift-window models. Algorithm 3 shows the steps of the process. Effectively, for words that are too long, its trailing characters are cut off before prepending start and appending end. Conversely for those short words, start and end are added before appending additional paddings. The uniform length is set to 20, with which the completeness of most dictionary words are preserved and the noisy tails of long tokens such as web addresses are truncated.

Algorithm 3 Unification

1: function UNIFY(word)

2: word ← [start] + word[1..18] + [end]

3: word ← word + [padding] × (20 − word.length)

4: return word

With preprocessed words, a function N₂ is defined such that it maps each distinct character to a distinct index. Suppose there are n distinct characters in the corpus.

Then their indices are from 3 to n+2. The indices 0, 1, 2 are reserved for the characters padding, end, and start respectively. For simplicity, the notation is abused so that N₂ could also represent a procedure that takes a character sequence c and returns a sequence of mapped indices of U N IF Y (c).

(51)

5.1. FEATURE EXTRACTION 37

With the procedure defined, the character-level features are extracted as the following. For each node i with left sibling j and right sibling k, its character-level features

c_i = (N₂(word_i), N₂(head_i), N₂(head_j), N₂(head_k))

where word_i denotes the word of i, and head_i, head_j and head_k denote the head words of i, j, and k respectively.

For example, suppose characters a, . . . , z, A, . . . , Z are mapped to 3, . . . , 28, 29, . . . , 54 and the uniform length is set to 5 for the purpose of demonstration. Then senator, Kennedy, and non-existent tokens are unified to (start,s,e,n,end), (start,K,e,n,end), and (start,end,padding,padding,padding) respectively. For node#5 in Figure 5.1, its character-level features

c₅ = (N₂(word₅), N₂(head₅), N₂(head₂), N₂(head_φ)

= (N₂(φ), N₂(Kennedy), N₂(senator), N₂(φ))

= ((2, 1, 0, 0, 0), (2, 39, 7, 16, 1), (2, 21, 7, 16, 1), (2, 1, 0, 0, 0)) where φ represents non-existent nodes and tokens.

5.1.3 Parse Tag Features

For parse tag features, a function N₃ is defined such that it maps each distinct parse tag to a distinct index. Suppose there are n distinct parse tags in the grammar.

Then their indices are from 1 to n. Non-existent parse tags are mapped to index 0.

For each node i with left sibling j and right sibling k, its parse tag features p_i = (N₃(pos_i), N₃(pos_j), N₃(pos_k))

(52)

where posi, posj, and posk denote the parse tag of i, j, and k respectively.

For example, suppose the pos-to-index mapping used is shown in Table 5.2. For node#5 in Figure 5.1, its parse tag features

p₅ = (N₃(pos₅), N₃(pos₂), N₃(pos_φ)

= (N₃(N P ), N₃(N N P ), N₃(φ))

= (5, 3, 0)

where φ represents non-existent nodes and parse tags.

p φ NN NNS NNP NNPS NP PYRAMID

N₃(p) 0 1 2 3 4 5 6

Table 5.2: An example of pos-to-index mapping. φ represents a non-existent parse tag. Added pyramid nodes all have the same sepcial tag.

5.1.4 Lexicon Hit Features

In addition to the aforementioned features derived from parse trees. Additional lexicon hit features are introduced by external lexicon resources. For each node, there is a feature per lexicon. If the constituent of the node can be found in a lexicon, then the lexicon feature value is set to 1 (hit); 0 (not hit), otherwise. All phrases are lower-cased before deciding equality.

For example, suppose there are three lexicons shown in Table 5.3. Then for

(53)

5.2. BRNN-CNN 39

node#5 in Figure 5.1, its lexicon hit features lex₅ = (1, 0, 0) because its constituent, Edward Kennedy, is found in the first lexicon.

PER donald rumsfeld edward kennedy finkel wesley

ORG european broadcasting union oxford health the new york times us airways group inc

LOC angelus oaks california city london mills waite park

Table 5.3: Three example lexicons of PER, ORG, and LOC.

5.2 BRNN-CNN

A special recursive neural network, BRNN-CNN, is proposed to classify each constituent from a constituency graph. To classify a node, BRNN-CNN does not only consider its features. Instead, BRNN-CNN considers the hidden states of the node, which are recursively computed from the features of its relevant linguistic structures.

To recursively compute hidden states, constituents must be structured as a directed acyclic graph. Then, BRNN-CNN could repeatedly apply its hidden layers from the sources to the sinks of the DAG. For each sentence, two DAGs are formed from its hierarchical constituency graph. One is formed by taking all the nodes and bottom-up links, and the other uses top-down links instead.

(54)

Because local information is propagated bottom-up according to constituency structures, bottom-up hidden states capture the semantic composition of each constituent. On the other hand, the features of an ancestry and their siblings are propagated down to each descendent, hence top-down hidden states capture global information for each node. These hidden states of each node together contain the structure information of a sentence relevant to the classification of the corresponding constituent.

5.2.1 Input Layer

The extracted features, described in the previous section, are first processed into real-valued vectors by the input layer of BRNN-CNN. For each node, its feature vector is the concatenation of vectors representing word-level, character-level, parse tag, and lexicon hit features.

Word-Level Vector

Suppose there are n distinct tokens in the corpus and the desired word embedding dimension is d_x. Then BRNN-CNN stores a word embedding look-up table W_x which is a n-by-d_x real-valued matrix. Effectively, every row of W_x represents a word embedding.

Recall that each word-level feature is simply a word index. BRNN-CNN trans- forms each word index into a n-dimensional one-hot vector except for 0. The special index 0 is transformed to a zero vector. Then the vector is multiplied by W_x to retrieve the word embedding. Finally, for a node i with word-level features x_i, a

(55)

5.2. BRNN-CNN 41

vector Xi is formed by concatenating the embeddings of the 4 words in xi. For example, suppose d_x = 2 and the word embedding look-up table

W_x =







11 11 22 22 33 33







.

Then the word embedding of word index 3 is computed by

0 0 1







11 11 22 22 33 33







=

33 33

.

And for word-level features x_i = (0, 3, 1, 0), X_i = (0, 0, 33, 33, 11, 11, 0, 0).

Character-Level Vector

It is more complex to compute a character-level vector for a node than other vectors in the input layer. In a nutshell, BRNN-CNN forms a matrix from the character sequence of each token and put it through a series of convolution, max- pooling, and highway layers. These computations actually consist the CNN part of the model, whereas the so-called input layer is actually the input layer of BRNN.

The first step is to form a character-level feature matrix for each token. Recall that for each token, its character-level features is a sequence of character indices.

BRNN-CNN forms a one-hot vector for each character index except for the index 0, which is transformed into a zero vector. These character vectors are then put together into a m-by-n matrix, where m is the uniform word length and n is the number of distinct characters in the corpus plus end and start.

(56)

Then sub-word patterns are captured by putting the character-level feature matrix through multiple convolution kernels in parallel. Kernels might have different heights as their window sizes, but their widths must be n. For example, if a kernel has height h, the convolutional layer will compute an (m-h+1)-by-1 feature map.

The feature map will then be max-pooled into a scalar, representing the signal strength of a sub-word pattern of length h. Finally, results of all the kernels are put into a vector, called u, of which length, called d_c, is the number of kernels.

However, as suggested by Kim et al. [15], the CNN-computed character-level feature vector u of each token is put through an additional highway layer. The final character-level vector of a token, called v, is computed as the following.

t = σ(W_tu + b_t)

v0 = ReLU (W_vu + b_v)

v = (1 − t) u + t v0

σ(x) represents the sigmoid function 1

1 + e^−x, and ReLU (x) = max(0, x). W_t and W_v are d_c-by-d_csquare weight matrices, and b_tand b_vare d_c-dimensional bias vectors.

Essentially, the final vector v is a weighted sum of its input u and the non-linear transformation v0 of u. By initializing b_tto a negative value, the layer initially sends its input direct to output like a highway.

In summary, for each node i, BRNN-CNN computes a character-level vector for each of the 4 character index sequences in c_i. These 4 vectors are then concatenated to form the character-level vector of the node, called C_i, for the input layer.

(57)

5.2. BRNN-CNN 43

Parse Tag Vector

A parse tag vector is computed for each node i. Suppose there are d_p distinct parse tags in the corpus. For each parse tag index in p_i, a d_p-dimensional one-hot vector is formed. The exception is that index 0 is transformed into a zero vector.

These vectors are concatenated into a long vector, called P_i, for the input layer.

Global Word Feature Vector

Aside from a constituent itself and its ancestors, the whole sentence provides useful additional information for NER. BRNN-CNN averages the word-level vectors of all tokens in a sentence as M_x. Similarly, the character-level vectors of all the tokens in a sentence are averaged to get another mean embedding M_c. These two vectors are used for every node in the constituency parse of the sentence as global knowledge.

Input Layer Vector

Finally, for each node i, BRNN-CNN computes its input layer by Equation 5.1.

Ii = X_ikCikPiklexikMxkMc. (5.1)

The dimension of I_i is given by

d_I = d_x× 4 + d_c× 4 + d_p× 3 + d_lex+ d_x+ d_c

where d_lex is the number of lexicons.

(58)

5.2.2 Hidden Layers

Having computed the input layer for every node on a constituency graph, BRNN- CNN recursively computes two hidden states for every node.

For each node i with the set of child nodes N and the parent p, the bottom-up hidden vector H_bot,i and top-down hidden vector H_top,i are recursively computed by Equation 5.2 and Equation 5.3 respectively.

H_bot,i= ReLU ((I_ik^X

j∈N

H_bot,j)W_bot+ b_bot) (5.2)

H_top,i = ReLU ((I_ikH_top,p)W_top+ b_top) (5.3)

The function ReLU (x) represents max(0, x). Suppose d_H is the desired hidden feature dimension. Then W_bot and W_top are (d_I+d_H)-by-d_H weight matrices. b_bot and b_top are d_H-dimensional bias vectors. If the set of child nodes are empty, a d_H-dimensional zero vector is used instead of ^X

j∈N

H_bot,j. Similarly if i has no parent, a zero vector is used instead of H_top,p.

The computations of the bottom-up and top-down hidden states of node#5 in Figure 5.1 are illustrated in Figure 5.2 and 5.3. The full bottom-up and the top- down passes on the constituency graph in Figure 5.1 are illustrated in Figure 5.4 and Figure 5.5.

(59)

5.2. BRNN-CNN 45

Figure 5.2: The bottom-up hidden layer applied to node#5 in Figure 5.1.

Figure 5.3: The top-down hidden layer applied to node#5 in Figure 5.1.

(60)

Figure 5.4: The bottom-up hidden layer applied to the graph in Figure 5.1.

Figure 5.5: The top-down hidden layer applied to the graph in Figure 5.1.

(61)

5.2. BRNN-CNN 47

Deep Hidden Layers

There can be more than one recursive hidden layer in BRNN-CNN. One-layered recursive neural networks is already deep in the sense of recursion: the hidden layer is stacked as many times as the height of the input hierarchy. Having multiple hidden layers, however, makes more powerful transformations between two neighboring nodes possible.

Suppose there are 3 hidden layers with desired dimension d_H1, d_H2, and d_H3. For each node i with parent node p, the top-down hidden state features can be computed by the following.

H_top,i,1 = ReLU ((I_ikH_top,p,1)W_top,1+ b_top,1) Htop,i,2= ReLU ((Htop,i,1kHtop,p,2)Wtop,2+ btop,2) H_top,i,3= ReLU ((H_top,i,2kH_top,p,3)W_top,3+ b_top,3)

ReLU (x) = max(0, x). W_top,1, W_top,2, and W_top,3 are (d_I+d_H1)-by-d_H1, (d_H1+d_H2)- by-d_H2, and (d_H2+d_H3)-by-d_H3 weight matrices respectively. b_top,1, b_top,2 and b_top,3 are d_H1, d_H2, and d_H3-dimensional bias vectors respectively. Figure 5.6 illustrates the three-layered computation. The bottom-up direction is generalized similarly.

Figure 5.6: The top-down deep hidden layer applied to node i with parent p.

利用語法結構之雙向遞迴類神經網路於命名實體辨識之研究

國立台灣大學電機資訊學院資訊工程學系 碩士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

利用語法結構之雙向遞迴類神經網路 於命名實體辨識之研究

Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks

李朋軒 Peng-Hsuan Li

指導教授： 許永真 博士 馬偉雲 博士

Advisor: Jane Yung-jen Hsu, Ph.D.

Wei-Yun Ma, Ph.D.

中華民國 106 年 11 月

November, 2017

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Background

1.2 Motivation

1.3 Objective

1.4 Outline of the Thesis

Related Work

2.1 NER

2.2 Related Neural Models

2.3 Constituents and NER

2.4 Recurrent NN and NER

Chapter 3

Constituency-Oriented NER

3.1 Problem Statement

3.2 Proposed Solution

Constituency Graph Generation

4.1 Constituency Graph

4.2 Consistency

4.3 Parse Tree Binarization

4.3.1 Observations of Type-1

4.3.2 Binarizing Parse Trees

4.3.3 An Example

4.4 Pyramid Construction

4.4.1 Observations of Type-2

4.4.2 Na¨ıve Pyramids

4.4.3 The Pyramid Addition Method

4.5 Dependency Transformation

Chapter 5

Constituent Classification

5.1 Feature Extraction

5.1.1 Word-Level Features

5.1.2 Character-Level Features

5.1.3 Parse Tag Features

5.1.4 Lexicon Hit Features

5.2 BRNN-CNN

5.2.1 Input Layer

5.2.2 Hidden Layers

國立台灣大學電機資訊學院資訊工程學系碩士論文

利用語法結構之雙向遞迴類神經網路於命名實體辨識之研究

指導教授：許永真博士馬偉雲博士