Sequence Labeling

Figure 4.2: Architecture of MONPA: multi-objective named entity & POS annotator.

4.2 Sequence Labeling

This experiment regards using an encoder-decoder [65] structure with the attention mech-anism [46] to perform sequence labeling with multi-task objectives. In particular, we conduct Chinese word segmentation, part-of-speech (POS), and named entity (NE) labeling simultane-ously. The input is a sequence of Chinese characters that may contain named entities, and the output is a sequence of POS tags and possibly NEs in the form of ‘BIES’ tags.

The model that are used in this task mainly consists of: embedding layer, recurrent encoder layers, attention layer, and decoder layers. The embedding layer converts characters into embeddings [50], which are dense, low-dimensional, and real-valued vectors. They capture syntactic and semantic information provided by its neighboring characters. In this work, we utilize pre-trained embeddings using word2vec and over 1 million online news articles.

The recurrent encoder layers use LSTM cells, which have been shown to capture long-term dependencies in the input sequence.

We employ a straightforward extension named Bidirectional RNN [26], which encodes sequential information in both directions (forward and backward) and concatenate the final

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

outputs. In this way, the output of one time step will contain information from its left and right neighbors. For tasks such as POS and NER where the label of one character can be determined by its context, bidirectional learning can be beneficial. The attention layer is proposed by Luong et al. [46] in an attempt to tackle the problem of finding corresponding words in the source and target languages when conducting machine translation. Finally, the recurrent decoder layers take the sequence of output from the attention layer and project them onto a V -dimensional vector where V equals the number of possible POS and NE tags. The overview of the complete model is shown in Figure 4.2. The loss of the model is defined as the averaged cross-entropy between an output sequence and true label sequence.

Test corpora from five previous SIGHAN shared tasks, which have been widely adopted for Traditional Chinese word segmentation and NER, were used to evaluate the proposed system.

Besides the participating systems in the above shared tasks, we also compare with existing word segmentation toolkits Jieba and CKIP [32]. The word segmentation datasets were taken from SIGHAN shared tasks of years 2003–2008, and NER dataset is from 2006. We follow the standard train/test split of the provided data, where 10,000 sentences of the training set are used as the validation set. Details of the word segmentation and NER datasets are shown in Table 4.4 and 4.5, respectively. Three metrics are used for evaluation: precision, recall, and F₁-score. For word segmentation, a token is considered to be correct if both the left and right boundaries match those of a word in the gold standard. For the NER task, both the boundaries and the NE type must be correctly identified.

4.2.1 Experimental Setup

In order to obtain multi-objective labels of the training data, we first merge datasets from the 2006 SIGHAN word segmentation and NER shared tasks. Since rich context information is

‧

Table 4.4: Statistics of the number of words in two word segmentation datasets.

Year AS CityU

Train Test Train Test

2003 5.8M 12K 240K 35K

2005 5.45M 122K 1.46M 41K

2006 5.5M 91K 1.6M 220K

2008 1.5M 91K -

-Table 4.5: Statistics of the number of words in the 2006 NER dataset.

#Train/Test Words

Person Location Organization 36K / 8K 48K / 7K 28K / 4K

able to benefit deep learning-based approach, we augment the training set by collecting online news articles². There are three steps for annotating the newly-created dataset. We first collect a list of NEs from Wikipedia and use it to search for NEs in the corpus, where longer NEs have higher priorities. Then, an NER tool [75] is utilized to label NEs. Finally, CKIP is utilized to segment and label the remaining words with POS tags. Three variants of the proposed model are tested, labeled as RNNCU06, RNNYA, and RNNCU06+YA. RNNCU06 is trained using only word segmentation and NER datasets from the 2006 City University (CU) corpus; RNNYA is trained using only online news corpus, and RNN_CU06+YA is trained on a combination of the above corpora.

We implemented the RNN model using pytorch³. The maximum sentence length is set to 80, where longer sentences were truncated and shorter sentences were padded with zeros. The forward and backward RNN each has a dimension of 300, identical to that of word embeddings.

2News articles are collected from the Yahoo News website and contains about 3M words.

3https://github.com/pytorch/pytorch

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

There are three layers for both encoder and decoder. Dropout layers exist between each of the recurrent layers. The training lasts for at most 100 epochs or when the accuracy of the validation set starts to drop.

4.2.2 Results and Discussion

Note that since we combined external resources, performances of the compared methods are from the open track of the shared tasks. Table 4.6a lists the results of the RNN-based models and top-performing systems for the word segmentation subtask on the Academia Sinica (AS) dataset. First of all, RNNs exhibit consistent capabilities in handling data from different years and is comparable to the best systems in the competition. In addition, it is not surprising that the RNNYAmodel perform better than RNNCU. Nevertheless, our method can be further improved by integrating the CU06 corpus, demonstrated by the results from the RNN_CU06+YAmodel. This indicates that RNN can easily adapt to different domains with data augmentation, which is an outstanding feature of end-to-end models. As for the CU dataset listed in Table 4.6b, all of the RNN models show considerable decrease in F-score. We postulate that it may be due to the training data, which is processed using an external tool focused on texts from a different linguistic context. It is also reported by [75] that segmentation criteria in AS and CU datasets are not very consistent. However, by fusing two corpora, the RNN_CU06+YAcan even surpass the performances of CKIP. Finally, comparison with Jieba validates that the RNN model can serve as a very effective toolkit for NLP researchers as well as the general public.

Table 4.7 lists the performances of proposed models and the only system that participated in the open track of the 2006 SIGHAN NER shared task. We can see that RNN_CU06outperforms the model from Yu et al. [78], confirming RNN’s capability on jointly learning to segment and recognize NEs. Interestingly, RNN_YA obtains a much lower F-score for all NE types. And

‧

Table 4.6: Word segmentation performance (% F-score) of various systems on different years of SIGHAN shared tasks, split into the Academia Sinica (AS) and City University (CU) datasets.

The best performance in a column is marked bold.

(a) AS dataset, open track

System F-score

Jacobs and Wong [36] 95.7

Wang et al. [71] 95.3

Chan and Chong [9] 95.6

Mao et al. [49] 93.6

Jieba 83.0 80.9 81.3 81.8 CKIP 96.6 94.2 94.6 94.9 RNNCU06 88.4 86.8 87.1 87.4 RNN_YA 94.4 92.8 93.0 93.3 RNNCU06+YA 94.6 93.2 93.6 93.8

(b) CU dataset, open track

System F-score

Jacobs and Wong [36] 97.4

Jieba 80.3 81.2 82.4 CKIP 89.7 89.0 89.8 RNNCU06 87.6 85.8 87.8 RNN_YA 88.0 87.2 88.5 RNNCU06+YA 91.5 90.1 91.7

RNN_CU06+YAcan only obtain a slightly better F-score for person recognition but not the overall performance of RNNCU06, even with the combined corpus. We believe that boundary mismatch may be a major contributing factor here. We also observe that there are a large number of one-character NEs such as abbreviated country names, which can not be easily identified using solely character features.

‧

Table 4.7: NER performance (% F-score) of different systems on the 2006 SIGHAN NER shared task (open track). The best performance in a column is marked bold.

System F-score

PER LOC ORG Overall Yu et al. [78] 80.98 86.04 68.01 80.51

RNN_CU06 81.13 86.92 68.77 80.68 RNNYA 70.54 67.80 31.35 52.62 RNN_CU06+YA 83.01 82.46 54.57 75.28

在文檔中遞歸及自注意力類神經網路之強健性分析 - 政大學術集成 (頁 54-59)

4.2 Sequence Labeling

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

4.2.1 Experimental Setup

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

4.2.2 Results and Discussion

‧

‧

立政治大學

立政治大學