Constituent Classification - 利用語法結構之雙向遞迴類神經網路於命名實體辨識之研究

The second stage of the proposed approach is tasked with classifying constituents by a constituency graph. To utilize relevant constituent structures in classifying each constituent, we propose to use a special recursive network, BRNN-CNN, as the model underlying the stage. By following bottom-up links, the model captures the semantic composition of local information for each constituent. By following top-down links, the global information of the structures containing each constituent is captured. Together, the bidirectional passes propagate relevant information on a constituency graph for the identification of named entity constituents.

5.1 Feature Extraction

For every node in a constituency graph, local features are drawn from itself, its left sibling, and its right sibling. Features in use include words, head words,

and parse tags. However, features are not always available because of the following reasons.

• Absent siblings,

• Absent words for nonterminal nodes, and

• Absent words and head words for added pyramid nodes.

Should these cases happen, dummy feature values are used.

Besides, total number of tokens in a sentence is used as a global feature.

5.1.1 Word-Level Features

For word-level features, a function N₁ is first defined such that it maps each distinct token to a distinct index. Suppose there are n distinct tokens in the corpus.

Then their indices are from 1 to n. A special index 0 is mapped for non-existent tokens.

With the index mapping defined, the word-level features are extracted as the following. For each node i with left sibling j and right sibling k, its word-level features

x_i = (N₁(word_i), N₁(head_i), N₁(head_j), N₁(head_k))

where word_i denotes the word of i, and head_i, head_j and head_k denote the head words of i, j, and k respectively.

For example, suppose the word-to-index mapping used is shown in Table 5.1.

5.1. FEATURE EXTRACTION 35

For node#5 in Figure 5.1, its word-level features

x₅ = (N₁(word₅), N₁(head₅), N₁(head₂), N₁(head_φ)

= (N₁(φ), N₁(Kennedy), N₁(senator), N₁(φ))

= (0, 3, 1, 0)

where φ represents non-existent nodes and tokens.

x φ senator Edward Kennedy Bob and Mary Schindler

N₁(x) 0 1 2 3 4 5 6 7

Table 5.1: An example of word-to-index mapping. φ represents a non-existent token.

Figure 5.1: Binarized tree for S₁ = (senator, Edward, Kennedy).

5.1.2 Character-Level Features

For character-level features, every word is treated as a sequence of characters, and in the case when a token is non-existent, an empty sequence is used as its character sequence. To facilitate batch computing, words are preprocessed so that they are uniform in length. This is achieved with the aid of a special padding character. In addition, special end and start characters are used to provide boundary information for shift-window models. Algorithm 3 shows the steps of the process. Effectively, for words that are too long, its trailing characters are cut off before prepending start and appending end. Conversely for those short words, start and end are added before appending additional paddings. The uniform length is set to 20, with which the completeness of most dictionary words are preserved and the noisy tails of long tokens such as web addresses are truncated.

Algorithm 3 Unification

1: function UNIFY(word)

2: word ← [start] + word[1..18] + [end]

3: word ← word + [padding] × (20 − word.length)

4: return word

With preprocessed words, a function N₂ is defined such that it maps each distinct character to a distinct index. Suppose there are n distinct characters in the corpus.

Then their indices are from 3 to n+2. The indices 0, 1, 2 are reserved for the characters padding, end, and start respectively. For simplicity, the notation is abused so that N₂ could also represent a procedure that takes a character sequence c and returns a sequence of mapped indices of U N IF Y (c).

5.1. FEATURE EXTRACTION 37

With the procedure defined, the character-level features are extracted as the following. For each node i with left sibling j and right sibling k, its character-level features

c_i = (N₂(word_i), N₂(head_i), N₂(head_j), N₂(head_k))

where word_i denotes the word of i, and head_i, head_j and head_k denote the head words of i, j, and k respectively.

For example, suppose characters a, . . . , z, A, . . . , Z are mapped to 3, . . . , 28, 29, . . . , 54 and the uniform length is set to 5 for the purpose of demonstration. Then senator, Kennedy, and non-existent tokens are unified to (start,s,e,n,end), (start,K,e,n,end), and (start,end,padding,padding,padding) respectively. For node#5 in Figure 5.1, its character-level features

c₅ = (N₂(word₅), N₂(head₅), N₂(head₂), N₂(head_φ)

= (N₂(φ), N₂(Kennedy), N₂(senator), N₂(φ))

= ((2, 1, 0, 0, 0), (2, 39, 7, 16, 1), (2, 21, 7, 16, 1), (2, 1, 0, 0, 0)) where φ represents non-existent nodes and tokens.

5.1.3 Parse Tag Features

For parse tag features, a function N₃ is defined such that it maps each distinct parse tag to a distinct index. Suppose there are n distinct parse tags in the grammar.

Then their indices are from 1 to n. Non-existent parse tags are mapped to index 0.

For each node i with left sibling j and right sibling k, its parse tag features p_i = (N₃(pos_i), N₃(pos_j), N₃(pos_k))

where posi, posj, and posk denote the parse tag of i, j, and k respectively.

For example, suppose the pos-to-index mapping used is shown in Table 5.2. For node#5 in Figure 5.1, its parse tag features

p₅ = (N₃(pos₅), N₃(pos₂), N₃(pos_φ)

= (N₃(N P ), N₃(N N P ), N₃(φ))

= (5, 3, 0)

where φ represents non-existent nodes and parse tags.

p φ NN NNS NNP NNPS NP PYRAMID

N₃(p) 0 1 2 3 4 5 6

Table 5.2: An example of pos-to-index mapping. φ represents a non-existent parse tag. Added pyramid nodes all have the same sepcial tag.

5.1.4 Lexicon Hit Features

In addition to the aforementioned features derived from parse trees. Additional lexicon hit features are introduced by external lexicon resources. For each node, there is a feature per lexicon. If the constituent of the node can be found in a lexicon, then the lexicon feature value is set to 1 (hit); 0 (not hit), otherwise. All phrases are lower-cased before deciding equality.

For example, suppose there are three lexicons shown in Table 5.3. Then for

5.2. BRNN-CNN 39

node#5 in Figure 5.1, its lexicon hit features lex₅ = (1, 0, 0) because its constituent, Edward Kennedy, is found in the first lexicon.

PER

Table 5.3: Three example lexicons of PER, ORG, and LOC.

5.2 BRNN-CNN

A special recursive neural network, BRNN-CNN, is proposed to classify each constituent from a constituency graph. To classify a node, BRNN-CNN does not only consider its features. Instead, BRNN-CNN considers the hidden states of the node, which are recursively computed from the features of its relevant linguistic structures.

To recursively compute hidden states, constituents must be structured as a di-rected acyclic graph. Then, BRNN-CNN could repeatedly apply its hidden layers from the sources to the sinks of the DAG. For each sentence, two DAGs are formed from its hierarchical constituency graph. One is formed by taking all the nodes and bottom-up links, and the other uses top-down links instead.

Because local information is propagated bottom-up according to constituency structures, bottom-up hidden states capture the semantic composition of each con-stituent. On the other hand, the features of an ancestry and their siblings are propagated down to each descendent, hence top-down hidden states capture global information for each node. These hidden states of each node together contain the structure information of a sentence relevant to the classification of the corresponding constituent.

5.2.1 Input Layer

The extracted features, described in the previous section, are first processed into real-valued vectors by the input layer of BRNN-CNN. For each node, its feature vector is the concatenation of vectors representing word-level, character-level, parse tag, and lexicon hit features.

Word-Level Vector

Suppose there are n distinct tokens in the corpus and the desired word embedding dimension is d_x. Then BRNN-CNN stores a word embedding look-up table W_x which is a n-by-d_x real-valued matrix. Effectively, every row of W_x represents a word embedding.

Recall that each word-level feature is simply a word index. BRNN-CNN trans-forms each word index into a n-dimensional one-hot vector except for 0. The special index 0 is transformed to a zero vector. Then the vector is multiplied by W_x to retrieve the word embedding. Finally, for a node i with word-level features x_i, a

5.2. BRNN-CNN 41

vector Xi is formed by concatenating the embeddings of the 4 words in xi. For example, suppose d_x = 2 and the word embedding look-up table

W_x =

Then the word embedding of word index 3 is computed by

It is more complex to compute a character-level vector for a node than other vectors in the input layer. In a nutshell, BRNN-CNN forms a matrix from the character sequence of each token and put it through a series of convolution, max-pooling, and highway layers. These computations actually consist the CNN part of the model, whereas the so-called input layer is actually the input layer of BRNN.

The first step is to form a character-level feature matrix for each token. Recall that for each token, its character-level features is a sequence of character indices.

BRNN-CNN forms a one-hot vector for each character index except for the index 0, which is transformed into a zero vector. These character vectors are then put together into a m-by-n matrix, where m is the uniform word length and n is the number of distinct characters in the corpus plus end and start.

Then sub-word patterns are captured by putting the character-level feature ma-trix through multiple convolution kernels in parallel. Kernels might have different heights as their window sizes, but their widths must be n. For example, if a kernel has height h, the convolutional layer will compute an (m-h+1)-by-1 feature map.

The feature map will then be max-pooled into a scalar, representing the signal strength of a sub-word pattern of length h. Finally, results of all the kernels are put into a vector, called u, of which length, called d_c, is the number of kernels.

However, as suggested by Kim et al. [15], the CNN-computed character-level feature vector u of each token is put through an additional highway layer. The final character-level vector of a token, called v, is computed as the following.

t = σ(W_tu + b_t)

v0 = ReLU (W_vu + b_v)

v = (1 − t) u + t v0

σ(x) represents the sigmoid function 1

1 + e^−x, and ReLU (x) = max(0, x). W_t and W_v are d_c-by-d_csquare weight matrices, and b_tand b_vare d_c-dimensional bias vectors.

Essentially, the final vector v is a weighted sum of its input u and the non-linear transformation v0 of u. By initializing b_tto a negative value, the layer initially sends its input direct to output like a highway.

In summary, for each node i, BRNN-CNN computes a character-level vector for each of the 4 character index sequences in c_i. These 4 vectors are then concatenated to form the character-level vector of the node, called C_i, for the input layer.

5.2. BRNN-CNN 43

Parse Tag Vector

A parse tag vector is computed for each node i. Suppose there are d_p distinct parse tags in the corpus. For each parse tag index in p_i, a d_p-dimensional one-hot vector is formed. The exception is that index 0 is transformed into a zero vector.

These vectors are concatenated into a long vector, called P_i, for the input layer.

Global Word Feature Vector

Aside from a constituent itself and its ancestors, the whole sentence provides useful additional information for NER. BRNN-CNN averages the word-level vectors of all tokens in a sentence as M_x. Similarly, the character-level vectors of all the tokens in a sentence are averaged to get another mean embedding M_c. These two vectors are used for every node in the constituency parse of the sentence as global knowledge.

Input Layer Vector

Finally, for each node i, BRNN-CNN computes its input layer by Equation 5.1.

Ii = X_ikCikPiklexikMxkMc. (5.1)

The dimension of I_i is given by

d_I = d_x× 4 + d_c× 4 + d_p× 3 + d_lex+ d_x+ d_c

where d_lex is the number of lexicons.

5.2.2 Hidden Layers

Having computed the input layer for every node on a constituency graph, BRNN-CNN recursively computes two hidden states for every node.

For each node i with the set of child nodes N and the parent p, the bottom-up hidden vector H_bot,i and top-down hidden vector H_top,i are recursively computed by Equation 5.2 and Equation 5.3 respectively.

H_bot,i= ReLU ((I_ik^X

j∈N

H_bot,j)W_bot+ b_bot) (5.2)

H_top,i = ReLU ((I_ikH_top,p)W_top+ b_top) (5.3)

The function ReLU (x) represents max(0, x). Suppose d_H is the desired hidden feature dimension. Then W_bot and W_top are (d_I+d_H)-by-d_H weight matrices. b_bot and b_top are d_H-dimensional bias vectors. If the set of child nodes are empty, a d_H-dimensional zero vector is used instead of ^X

j∈N

H_bot,j. Similarly if i has no parent, a zero vector is used instead of H_top,p.

The computations of the bottom-up and top-down hidden states of node#5 in Figure 5.1 are illustrated in Figure 5.2 and 5.3. The full bottom-up and the top-down passes on the constituency graph in Figure 5.1 are illustrated in Figure 5.4 and Figure 5.5.

5.2. BRNN-CNN 45

Figure 5.2: The bottom-up hidden layer applied to node#5 in Figure 5.1.

Figure 5.3: The top-down hidden layer applied to node#5 in Figure 5.1.

Figure 5.4: The bottom-up hidden layer applied to the graph in Figure 5.1.

Figure 5.5: The top-down hidden layer applied to the graph in Figure 5.1.

5.2. BRNN-CNN 47

Deep Hidden Layers

There can be more than one recursive hidden layer in BRNN-CNN. One-layered recursive neural networks is already deep in the sense of recursion: the hidden layer is stacked as many times as the height of the input hierarchy. Having multiple hid-den layers, however, makes more powerful transformations between two neighboring nodes possible.

Suppose there are 3 hidden layers with desired dimension d_H1, d_H2, and d_H3. For each node i with parent node p, the top-down hidden state features can be computed by the following.

H_top,i,1 = ReLU ((I_ikH_top,p,1)W_top,1+ b_top,1) Htop,i,2= ReLU ((Htop,i,1kHtop,p,2)Wtop,2+ btop,2) H_top,i,3= ReLU ((H_top,i,2kH_top,p,3)W_top,3+ b_top,3)

ReLU (x) = max(0, x). W_top,1, W_top,2, and W_top,3 are (d_I+d_H1)-by-d_H1, (d_H1+d_H2 )-by-d_H2, and (d_H2+d_H3)-by-d_H3 weight matrices respectively. b_top,1, b_top,2 and b_top,3 are d_H1, d_H2, and d_H3-dimensional bias vectors respectively. Figure 5.6 illustrates the three-layered computation. The bottom-up direction is generalized similarly.

Figure 5.6: The top-down deep hidden layer applied to node i with parent p.

5.2.3 Output Layer

The output layer of BRNN-CNN identifies named entity constituents. For each node, the input bottom-up and top-down hidden state vectors contain relevant local and global information of the corresponding constituent. The output is the predicted probability distribution of named entity classes.

Given the set of named entity categories C with size n, a function N₄ is first defined to map each distinct NE category to a distinct integer between 1 . . . n. The number n + 1 is reserved for the special category NON NE. The inverse function mapping integers to categories is denoted by N₄⁻¹. Therefore the dimension of predicted distributions d_O= n + 1.

For any node x, let H_x = H_bot,x+ H_top,x. And let σ denote the softmax function where, for any vector x, σ(x)_t = ex^t

ex^u. Then for each node i with left sibling j and right sibling k, its class probability distribution is given by Equation 5.4.

O_i = σ((H_ikH_jkH_k)W_out+ b_out) (5.4)

Wout is a (dH × 3)-by-dO weight matrix. bout is a dO-dimensional bias vector. If a sibling does not exist, zero vectors are used as its hidden states. Should deep hidden layers be deployed, the last hidden layer is used. Figure 5.7 illustrates the computation of the output layer.

5.3. PREDICTION COLLECTION 49

Figure 5.7: The output layer applied to node i with left sibling j and right sibling k.

The fully-connected layer does not contain non-linear transformations like ReLU.

5.3 Prediction Collection

After BRNN-CNN produces a probability distribution for each constituent, the set of predicted named entities are now collected from the constituency graph.

For each constituent, the system classifies it as the category with highest pre-dicted probability. Formally, for each node i, the system predicts its label by Equa-tion 5.5.

L_i = N₄⁻¹(argmax

O_ij) (5.5)

O_ij represents the j-th element of O_i.

Intuitively, the predictions of a sentence are the label L_i for every node i unless L_i = N ON N E. Given the set of sentences S and the set of named entity categories

C, the set of predicted named entities

E_a= {(i, (j, k), L_n) | S_i ∈ S, n ∈ GN_S_i, (S_ij, . . . , S_i(k−1)) = const_n, L_n∈ C}

where GN_S_i denotes the set of nodes in the constituency graph of the sentence S_i, and const_n denotes the corresponding constituent of the node n.

In many applications, overlapping named entities, including nested named enti-ties, are not practically useful and should not be predicted by NER systems. These could happen to the above NE collection method. For instance, two NEs are nested if their corresponding nodes are an ancestor and its descendant.

To form non-overlapping predictions, a special NE collection scheme is applied by the system. The scheme traverses through every node of a constituency graph in two passes. The first pass performs a depth-first walk from the root node of the original constituency parse tree in the graph. The system stops recurring down a branch as soon as it encounters a node i in the branch of which L_i 6= N ON N E. Then a second pass walk through the additional pyramid nodes in a breadth-first fashion, collecting an additional NE only if it does not overlap with previously collected NEs.

Briefly speaking, the system forms non-overlapping predictions from the output of BRNN-CNN by preferring tree nodes over additional pyramid nodes and larger NEs over smaller ones. The heuristics behind this scheme is to appreciate parsing and to avoid nested NEs, like the first name and the last name of a full name.

Chapter 6 Evaluation

The constituency-oriented approach is evaluated on CoNLL 2003 NER and OntoNotes 5.0 NER. The detailed setup of the experiments is given in the first section. Major results against state-of-the-art systems and related work are shown in the second section. Finally, we analyze different aspects of the approach with ablation studies and discuss their abstractive meaning with case studies.

6.1 Experimental Setup

6.1.1 Parameter Initialization

Parameters of BRNN-CNN are contained in the CNN, the highway layer, the BRNN hidden layers, and the output layer. In addition, the word embedding look-up table is also trainable. Table 6.1 summarizes the weights and biases of which values need to be decided.

Except for Wx, the parameters of all other layers are initialized with Xavier initializer [11]. The initializer tries to ensure the scale of output values for deep networks. This is desirable because BRNN-CNN might go down an indefinitely deep recursion.

On the other hand, the initialization of the word embedding look-up table has two cases. First, a pre-trained table is attempted to be used. For example, unsupervised word embeddings trained by GloVe [23] from a 840 billion-token web corpus are available. If a pre-trained GloVe vector could be found for a word, the corresponding row of W_x is initialized by that vector. Otherwise, the Gaussian distribution with zero mean and 0.1 standard deviation is used for sampling.

Layer Weights and Biases

Word Embedding Look-Up Table W_x

CNN Kernels

Highway Layer W_t, W_v, b_t, b_v Hidden Layers W_bot, W_top, b_bot, b_top

Output Layer Wout, b_out

Table 6.1: Parameters of BRNN-CNN.

6.1.2 Parameter Optimization

Once initialized, the model parameters can be optimized with a training corpus.

In a training corpus, all ground truth named entities are known to BRNN-CNN.

6.1. EXPERIMENTAL SETUP 53

For each node n in the constituency graph of a sentence, its ground truth NE label is denoted by L_n,g. If n corresponds to a named entity e = (i, (j, k), c), L_n,g = c.

Otherwise, L_n,g = N ON N E. Equation 6.1 gives the loss function of a node.

lossn= − log O_nN₄_(L_n,g₎ (6.1)

O_nN₄_(L_n,g₎ denotes the N₄(L_n,g)-th element of O_n. The objective is set to minimize the average loss of all nodes for a corpus. To achieve this, Adam optimizer [16] is used for parameter updates.

To avoid overfitting for gradient descent optimization algorithms such as Adam, it is desirable to stop iterating parameter updates with the aid of a validation corpus.

Let F denotes an evaluation function, E_a denotes the prediction of BRNN-CNN for the validation corpus, and E_g denotes the set of ground truth NEs of the corpus.

After each round, or epoch, of parameter update with the training corpus, the validation score F (E_a; E_g) is checked. Since initialization, the best score is kept. If a record has not been broken for 20 epochs, the training stops and the parameter values that achieved the best score are restored.

Dropout

Zeroing some of the output of some network layers is oftentimes beneficial to training. BRNN-CNN is no exception and dropout layers are added for the input

在文檔中利用語法結構之雙向遞迴類神經網路於命名實體辨識之研究 (頁 47-74)