Evaluation Metrics

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

of applications. Note that, the pre-training loss of these models may be different. But, as long as it is designed to incorporate helpful auxiliary information such as linguistic knowledge, the resulting model can be stronger when learning a new task.

2.5 Evaluation Metrics

Throughout this dissertation, we use standard metrics in NLP to evaluate the performance of our models. Binary classification typically adopts the accuracy measurement, defined as:

Acc = TP + TN

TP + FP + TN + FN, (2.5.1)

where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives respectively. For multi-class classification, where each input sample is classified as one of many classes, the F1-scoreis typically used. It is defined as:

F₁-score = 2× Precision × Recall

Precision + Recall , (2.5.2)

where Precision and Recall are defined as:

Precision = TP TP + FP Recall = TP

TP + FN

(2.5.3)

For machine translation, it is typical to use the BLEU (bilingual evaluation understudy) score to evaluate the quality of the translated text [55]. It is designed to measure the correspondence between the output of a machine learning model and the golden translation.

BLEU was one of the first metrics to obtain a high correlation with human judgments of quality,

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

and remains one of the most popular automatic as well as cost-effective metrics. Evaluation of the score involves calculating the similarity between pieces of the translation individually and the reference text. Then, the overall score is the mean over the entire corpus. It should be noted that grammaticality, intelligibility, semantic quality cannot be evaluated in this manner. The value of BLEU score lies between 0 and 1, where 1 indicates that the automatic translation is identical to a segment in the golden translations.

For automatic summarization, we adopt the commonly used “Recall-Oriented Understudy for Gisting Evaluation” (ROUGE) scores [44]. The ROUGE method aims at calculating the ratio of the unit overlap between the generated results and the golden summary. The unit used here can be N-gram or character sequences. Specifically, we use three ROUGE calculation methods:

ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (Longest Common Subsequence) scores. To improve legibility, we will abbreviate them as R-1, R-2 and R-L henceforth.

Intuitively, R-1 can be thought of as to represent the amount of information of automatic summaries, whereas R-2 is to evaluate the overall fluency of said summaries. Finally, R-L can be regarded as the coverage rate of the summary over the original article.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 3 Methods

This chapter provides descriptions of various neural architectures and how to adapt them for the downstream tasks, which include sequence classification, part-of-speech tagging, named entity recognition, sentiment analysis, entailment, translation, and summarization.

3.1 Neural Networks

Artificial neural networks (ANN) have become a prominent tool for natural language processing in recent years. As a result, a wide variety of network structures are used as the basis of models for the experiments in this work. Before going into details of each task, the model architectures are briefly described in the following sections.

3.1.1 Recurrent Neural Networks

In essence, natural language inputs consist of a sequence of words or sub-word tokens, in which the order cannot be freely alternated. Therefore, RNNs emerge as an intuitive choice.

We propose an approach for identifying protein-protein interaction (PPI) in biomedical

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

literature using RNN with LSTM cells. We employ a straightforward extension named Bidirectional RNN, which encodes sequential information in both directions (forward and backward) and concatenate the final outputs. In this way, the output of one time step will contain information from its left and right neighbors.

For classification tasks including sentiment analysis and entailment detection, we use a Bidirectional LSTM [31] with an attention [4] layer as the sentence encoder, and a fully connected layer for the classification task. Similarly, for tasks such as POS and NER where the label of one character can be determined by its context, bidirectional learning can be beneficial.

For machine translation, we employ a common seq2seq model [65], in which both the encoder and decoder are a 2-layer stacked Bi-LSTM with 512 hidden units.

For abstractive summarization, we use a layer of LSTM network with attention mechanism, and compare the difference between uni-directional and bi-directional networks, as well as the impact of the LSTM cell dimension, word vector dimension and other parameters.

3.1.2 Self-Attentive Models

Self-attentive models including Transformer [70] and “Bidirectional Encoder Representa-tions from Transformers,” shortened as BERT [18], rely on the attention mechanism [46] to learn a context-dependent representation, or encoding. As such, self-attention has been successfully applied in several tasks. Similar to bidirectional LSTM, this type of encoder takes x₀, x₁,· · · , xn

as the input, and produces context-aware word representations riof all positions 0⩽ i ⩽ n. We employ a stack of N identical self-attention layers, each having independent parameters.

The classification problems adopt the BERT model with an identical setup to the original paper, in which BERT is used as an encoder that represents a sentence as a vector. This vector is then used by a fully connected neural network for classification. Note that models are tuned

‧

Transformer Transformer Transformer ... Transformer n layers

Transformer Transformer Transformer Transformer Transformer

“ [SEP] ”

Transformer Transformer Transformer ... Transformer

...

Transformer

Transformer Transformer Transformer Transformer Transformer

...

(b) Sentence pair classification

Figure 3.1: Classification of sentence and sentence pair using BERT

‧

Transformer Transformer Transformer ... Transformer

...

Transformer

Transformer Transformer Transformer Transformer Transformer

...

Figure 3.2: Named entity recognition using BERT

separately for each task. Figure 3.1 denotes how to model BERT model for classification tasks.

Figure 3.2 denotes how to perform sequence labeling, such as NER, using BERT. Figure 3.3 illustrates the approach for building a question answering system with BERT.

In addition, we tried to determine the effect of pre-training by testing a compact version of BERT, named BERTnopt. It comprises of three self-attention layers instead of 12.

To the best of our knowledge, machine translation models do not typically employ BERT.

Therefore, for our MT experiments, a Transformer encoder-decoder model is utilized.

在文檔中遞歸及自注意力類神經網路之強健性分析 - 政大學術集成 (頁 39-44)

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.5 Evaluation Metrics

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 3 Methods

3.1 Neural Networks

3.1.1 Recurrent Neural Networks

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

3.1.2 Self-Attentive Models

‧

‧

立政治大學

立政治大學

立政治大學

立政治大學