遞歸及自注意力類神經網路之強健性分析 - 政大學術集成

全文

(1)國立政治大學社群網路與人智計算國際研究生博士學位學程博士學位論文. 遞歸及自注意力類神經網路之政治大. 立強健性分析. ‧ 國. 學. ‧. Analysis of the Robustness of Recurrent. n. al. er. io. sit. y. Nat. and Self-attentive Neural Networks Ch. engchi. i Un. v. 博士班學生：謝育倫撰指導教授：許聞廉博士劉昭麟博士中華民國 109 年 6 月 DOI:10.6814/NCCU202001426.

(2) 致謝本論文能夠完成，需要感謝非常多一路上幫助我的師長、同儕、以及家人。首先要謝謝我的指導教授—許聞廉博士及劉昭麟博士，他們總是給我. 政治大. 空間來探索各種可能的研究方向，也願意提供充沛的資源支援。如果沒有. 立. 老師的幫忙，就不會有今天的成果。同時，感謝加州大學謝卓叡教授，在. ‧ 國. 學. 我前往美國擔任訪問學者期間的討論與指導，讓我們的研究結晶能夠發表在國際頂尖會議中。而博士班的研究生涯，則起源於進入中研院資訊所智. ‧. 慧型代理人系統實驗室。在這近十年的經歷中，有幸能夠接受許多博士前. y. Nat. er. io. sit. 輩的帶領，包含李政緯、施政緯、姜天戩、劉士弘、張詠淳等，不僅在學術領域上給予我極大的幫助，也傳授我許多軟體技術的要點。並且要感謝. n. al. Ch. i Un. v. 實驗室其他同仁：翠玲、秋珍、照玲、永瑜、庭豪、千嘉、柏廷、堃育、欣. engchi. 裕、乃文、國豪、明祥、俊翰、書豪、友珊、新蓉、瑩棻、佳蓉、瑞祥、俊宏、茂彰、建勤，以及更多共事過的伙伴。是大家的合作無間，造就了我們這個研究團隊今天的成果。另外，我也非常感激二十餘年來的同學及好友—大成，除了在求學階段有過許多向他討教的機會，更是在我赴美訪問時大力相助，我才能夠平安快樂的度過那段時光，這段恩澤將永銘於我心。最後，要感謝雨芬，在我們相知相惜的這段不算長的日子，願意給我鼓勵和信任，還有不辭辛勞的奔波，多虧有了這些支持，我才能在有限的時間. i DOI:10.6814/NCCU202001426.

(3) 內完成學業。最重要的，是感謝我的父親、母親，自我來到這個世界時，就持續無條件的照顧和付出，讓我能無憂無慮的學習，這麼深厚的恩情，實在無以回報，謹以此論文獻給他們。. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. ii DOI:10.6814/NCCU202001426.

(4) 中文摘要本文主要在驗證目前被廣泛應用的深度學習方法，即利用類神經網路所建構的機器學習模型，在自然語言處理領域中之成效。同時，我們對各式. 政治大. 模型進行了一系列的強健性分析，其中主要包含了觀察這些模型對於對抗. 立. 性（adversarial）輸入擾動之抵抗力。更進一步來說，本文所進行的實驗對. ‧ 國. 學. 象，包含了近期受到許多注目的 Transformer 模型，也就是建構在自我注意力機制之上的一種類神經網路，以及目前常用的，基於長短期記憶 (LSTM). ‧. 細胞所搭建的遞歸類神經網路等等不同網路架構，觀察其應用於自然語言. y. Nat. er. io. sit. 處理上的結果與差異。在實驗內容上，我們囊括了許多在自然語言處理領域中最常見的工作，例如：文本分類、斷詞及詞類標註、情緒分類、蘊含. n. al. Ch. i Un. v. 分析、文件摘要及機器翻譯等。結果發現，基於自我注意力的 Transformer. engchi. 架構在絕大多數的工作上都有較為優異的表現。除了使用不同網路架構並對其成效進行評估，我們也對輸入之資料加以對抗性擾動，以測試不同模型在可靠度上的差異。另外，我們同時提出一些創新的方法來產生有效的對抗性輸入擾動。更重要的是，我們基於前述實驗結果提出理論上的分析與解釋，以探討不同類神經網路架構之間強健性差異的可能來源。關鍵字：自我注意力機制、對抗性輸入、遞歸類神經網路、長短期記憶、強健性分析. iii DOI:10.6814/NCCU202001426.

(5) Abstract In this work, we focus on investigating the effectiveness of current deep learning methods, also known as neural network-based models, in the field of. 政治大. natural language processing. Additionally, we conduct robustness analysis of. 立. various neural model architectures.. We evaluate the neural network’s resis-. ‧ 國. 學. tance to adversarial input perturbations, which in essence is replacing the input words so that the model might produce incorrect results or predictions.. We. ‧. compare the differences between various network architectures, including the. y. Nat. er. io. sit. Transformer network based on the self-attention mechanism, and the commonly employed recurrent neural networks using long short-term memory cells (LSTM).. n. al. Ch. i Un. v. We conduct extensive experiments that include the most common tasks in the field. engchi. of natural language processing: sentence classification, word segmentation and part-of-speech tagging, sentiment classification, entailment analysis, abstractive document summarization, and machine translation. In the process, we evaluate their effectiveness as compared with other state-of-the-art approaches. We then estimate the robustness of different models against adversarial examples through five attack methods. Most importantly, we propose a series of innovative methods to generate adversarial input perturbations, and devise theoretical analysis from our. iv DOI:10.6814/NCCU202001426.

(6) observations. Finally, we attempt to interpret the differences in robustness between neural network models. Keywords: Robustness, Adversarial Input, RNN, LSTM, Self Attention. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. v DOI:10.6814/NCCU202001426.

(7) Contents. 致謝. i. 政治大. 中文摘要. 立. Abstract. ‧ 國. sit. n. al. xiii. er. io. 1 Introduction. x. y. Nat. List of Figures. vi. ‧. List of Tables. iv. 學. Contents. iii. Ch. engchi. i Un. v. 1. 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.4. Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2 Background and Related Work. 8. 2.1. Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.2. Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. DOI:10.6814/NCCU202001426.

(8) 2.2.2. Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . .. 13. 2.2.3. Long Short-term Memory . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.2.4. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16. Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 2.3.1. Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. Adversarial Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 2.4.1. 23. Pre-training and Multi-task Learning . . . . . . . . . . . . . . . . . .. 立. 3 Methods. ‧ 國. 26. 3.1.1. Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . .. 26. 3.1.2. Self-Attentive Models . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. ‧. Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Nat. 3.2. Adversarial Attack Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .. io. al. 29. Random Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2.2. List-based Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 3.2.3. Greedy Select & Greedy Replace . . . . . . . . . . . . . . . . . . . .. 31. 3.2.4. Greedy Select with Embedding Constraint . . . . . . . . . . . . . . . .. 32. 3.2.5. Attention-based Select . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. n. 3.2.1. Ch. engchi U. v ni. 4 Experiments 4.1. 24 26. 學. 3.1. 政治大. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. y. 2.5. 12. sit. 2.4. Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . .. er. 2.3. 2.2.1. 30. 34. Text Sequence Classification in Biomedical Literature . . . . . . . . . . . . . .. 34. 4.1.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 4.1.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. DOI:10.6814/NCCU202001426.

(9) 4.2.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. 4.2.2. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 4.3.1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.3.2. Quality of Adversarial Examples . . . . . . . . . . . . . . . . . . . . .. 47. Textual Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.4.1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 政治大 Abstractive Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立. 50. 4.5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 4.5.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53. 4.5. ‧. Nat. io. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. n. 5 Discussions 5.1. 58. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. al. Ch. engchi. i Un. 52. 58. Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1. 4.7. 學. 4.6. Quality of Adversarial Examples . . . . . . . . . . . . . . . . . . . . .. y. 4.4.2. sit. 4.4. 39. er. 4.3. Sequence Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 4.2. 59. v. 63. Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63. 5.1.1. Sensitivity of Self-attention Layer . . . . . . . . . . . . . . . . . . . .. 63. 5.1.2. Illustration of the Proposed Theory . . . . . . . . . . . . . . . . . . .. 65. 6 Conclusions. 67. 6.1. Theoretical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. 6.2. Unsolved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. DOI:10.6814/NCCU202001426.

(10) Bibliography. 69. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. DOI:10.6814/NCCU202001426.

(11) List of Tables. 3.1. Illustrative examples of semantically similar words. . . . . . . . . . . . . . . .. 31. 4.1. Descriptive Statistics of AIMed and BioInfer, the two largest PPI corpora. . . .. 政治大 Ten-fold cross-validation of the performance of various PPI classification mod立. 35. 4.2. ‧ 國. 學. els on corpora AIMed and BioInfer. Metrics are precision, recall, and F-score (in %) and standard deviation in parentheses. The bold numbers highlight the. ‧. best performance of a column. . . . . . . . . . . . . . . . . . . . . . . . . . .. Nat. y. Cross-corpus evaluation of the F-score (in %) of various PPI classification. sit. 4.3. 37. n. al. er. io. models on corpora AIMed and BioInfer. The bold numbers highlight the best. i Un. v. performance of a column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Ch. engchi. 38. 4.4. Statistics of the number of words in two word segmentation datasets. . . . . . .. 41. 4.5. Statistics of the number of words in the 2006 NER dataset. . . . . . . . . . . .. 41. 4.6. Word segmentation performance (% F-score) of various systems on different years of SIGHAN shared tasks, split into the Academia Sinica (AS) and City University (CU) datasets. The best performance in a column is marked bold. . .. 4.7. 43. NER performance (% F-score) of different systems on the 2006 SIGHAN NER shared task (open track). The best performance in a column is marked bold. . .. 44. DOI:10.6814/NCCU202001426.

(12) 4.8. Effectiveness (% success) of different attacks on sentiment analysis models. The highest attack rate in a column is marked bold. . . . . . . . . . . . . . . . . . .. 4.9. 46. Example of adversarial attacks on BERT sentiment analysis models, as generated by GS-GR and GS-EC approaches. These attacks can change the output of the model such that the opposite sentiment is predicted. Notably, attacks by GS-EC utilize words that are locally coherent and fluent, possibly due to the constraint on embedding similarity. On the other hand, GS-GR attacks are more incoherent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 政治大 GS-EC attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立. 48. 4.10 Human evaluations of the quality of attacks on LSTM and BERT models using. 49. ‧ 國. 學. 4.11 Human evaluations of GS-GR and GS-EC attacks on BERT model for sentiment analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. ‧. 4.12 Rate of success (%) of different attacks on LSTM and BERT for the MultiNLI. y. Nat. io. sit. models (dev set). The highest attack rate in a column is marked bold. . . . . . .. 50. n. al. er. 4.13 Adversarial examples generated by GS-GR and GS-EC attacks for BERT. Ch. i Un. v. entailment classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 4.14 Statistics of the LCSTS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. engchi. 4.15 Effects of directionality and dimension on ROUGE scores of abstractive summaries using the LCSTS corpus. . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 4.16 Compare ROUGE on the LCSTS corpus using different methods. . . . . . . . .. 55. 4.17 Examples of higher-quality abstractive summaries generated by our method. Some differences between the generated and golden summary are marked by underlines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. DOI:10.6814/NCCU202001426.

(13) 4.18 Examples of abstractive summaries with lower ROUGE scores generated by our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.19 Success rate of targeted attack on translation models using the GS-EC method.. 57 59. 4.20 Comparison of BLEU scores using typo-based attack on translation models built with LSTM and Transformer models.. . . . . . . . . . . . . . . . . . . . . . .. 59. 4.21 Targeted adversarial examples for machine translation models based on LSTM and Transformer (denoted by TF) with the target keyword “Art.” in the output. .. 立. 61. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. DOI:10.6814/NCCU202001426.

(14) List of Figures. The multi-head attention module in a Transformer block. . . . . . . . . . . . .. 20. 3.1. Classification of sentence and sentence pair using BERT . . . . . . . . . . . .. 政治大 Named entity recognition using BERT . . . . . . . . . . . . . . . . . . . . . . 立. 28. 3.3. Question answering using BERT . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 4.1. Recurrent neural network-based PPI classification model. . . . . . . . . . . . .. 34. 4.2. Architecture of MONPA: multi-objective named entity & POS annotator. . . .. 4.3. Comparison of the distribution of attention scores in a model when the input is. ‧. ‧ 國. 29. Nat. io. sit. y. 39. er. 3.2. 學. 2.1. (a) the original input, (b) ASMIN -EC, and (c) ASMAX -EC attacks. The word that. n. al. Ch. i Un. v. is selected by the attack is indicated by red boxes. Note how the selection of. engchi. the target word is based on the lowest or highest attention score, as defined by ASMIN -EC, and ASMAX -EC. Both attacks successfully changed the prediction of the model from positive to negative. . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.4. Shift of attention scores under GS-EC attack on (a) LSTM and (b) BERT models. 47. 4.5. Heatmap of attention scores in LSTM and Transformer models for machine translation when the input is original and adversarial. . . . . . . . . . . . . . .. 60. DOI:10.6814/NCCU202001426.

(15) L2 norm of embedding variations within (a) LSTM and (b) BERT when one of the input words is swapped, as indicated by the red box.. 立. . . . . . . . . . . . .. 66. 政治大. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. 5.1. Ch. engchi. i Un. v. DOI:10.6814/NCCU202001426.

(16) Chapter 1 Introduction 立. 1.1 Motivation. 政治大. ‧ 國. 學. Research on Artificial Intelligence (AI) has been increasingly influential in recent years.. ‧. In particular, Natural Language Processing (NLP), a field that combines linguistic theories and. Nat. sit. y. techniques in computer science, is poised to become the center of all AI applications. It is not. n. al. er. io. surprising, since the ability to utilize language is a crucial part of the human life as well as. i Un. v. intelligence. Without NLP, integrating machines with the human intelligence would be very. Ch. engchi. infeasible. Meanwhile, machine learning (ML) methods have been widely applied to a broad spectrum of problems in this field. Currently, prominent ML models rely on neural network-based architectures to obtain state-of-the-art outcomes on many NLP jobs, e.g., classification of documents, sentiments, and machine translation (MT). Notably, self-attention-based models are receiving a surge of recognition in the past few years. This type of models, including Transformer [70] as well as “Bidirectional Encoder Representations from Transformers,” or BERT [18], rely on the attention mechanism [46] to learn a context-dependent word representation. In particular, BERT is 1 DOI:10.6814/NCCU202001426.

(17) proposed more recently in an attempt to encode even richer contextual information into the vector representation of words. Compared with recurrent neural networks (RNN), these Transformer-based models have a more efficient encoding speed while maintaining the capacity of incorporating a broader contextual information. A common pre-training method of this type of models involves a unidirectional language model objective, whereas BERT exploits a new process that randomly drops some of the input words and an alternative objective that tries to determine the missing ones using only its neighboring tokens. Particularly, BERT is pre-trained in the way that utilizes multiple. 政治大 representation that fuses both the left and right context, creating a bidirectional feature extractor. 立 goals to force its encoding capability. This novel design prompts the model to learn a combined. ‧ 國. 學. In addition, the alternative “next sentence prediction” objective is also included, where the model has to classify whether a pair of input sentences is extracted from two consecutive locations in. ‧. the training corpus. Subsequently, this pre-trained model can be easily tuned to perform a wide. y. Nat. io. sit. variety of down-stream tasks. The BERT model obtains state-of-the-art results on numerous. n. al. er. NLP problems, e.g., classification and question answering. Oftentimes, we see it surpassing. Ch. i Un. v. task-specific models that are carefully engineered. Thus, it is fast becoming a core element in solving a wide variety of NLP tasks.. engchi. Nevertheless, the structure, i.e., self-attention, underlying BERT and Transformer requires a further investigation. Given their success, the robustness of these model against adversarial attacks compared with other neural networks is yet to be studied. An adversarial attack involves the application of tiny perturbations on the input of the model, thereby creating a so-called adversarial example where the change is humanly imperceptible. This example can trick the model into making an error in prediction [24]. Unlike those in computer vision research, it is challenging to come up with a textual adversarial example that is not easily detected by humans 2 DOI:10.6814/NCCU202001426.

(18) and misguides the machine [3, 39, 54, 82]. However, some recent work unveil that these models are vulnerable to adversarial examples with acceptable quality [5, 37]. The adversarial inputs can be precisely recognized by human evaluation, and at the same time trick ML models to produce incorrect results. For example, a review that says “We had a great experience 6 months ago, but last night was strikingly different.” It is imaginable that this review is implying that the author did not have a positive sentiment. However, a machine can be easily fooled by the majority of positive-sounding words in the sentence, and therefore determine that this is a positive sentence. The fragility of statistical machine learning models resembles even the traditional rule-based. 政治大 set. In other words, the generality of these models is not guaranteed. These problems urge us to 立 ones. They can be unaware of the meaning of novel data that was not present in the training. ‧. ‧ 國. 學. investigate the robustness of current deep learning models when applied to NLP applications.. 1.2 Research Objectives. sit. y. Nat. n. al. er. io. The main focus of the current dissertation pertains to evaluating the capabilities of deep. i Un. v. neural networks when used for a number of fundamental tasks in natural language processing.. Ch. engchi. Particularly, we investigate neural networks with different structures, including recurrent and self-attentional models. The goal of this work is manifold. Most importantly, we attempt to answer the following research question: 1. When compared with traditional recurrent networks, are self-attentive (Transformer) models more robust to adversarial examples? 2. Why does one network structure outperform the other with regards to robustness?. 3 DOI:10.6814/NCCU202001426.

(19) 3. Can we use the attention weights in self-attentive neural networks to exploit vulnerability in these models? In short, this work verifies the robustness of recurrent and self-attentive models. This is accomplished through performing adversarial attacks and analyzing their effects on the model prediction. Besides, we evaluate the possibility of employing these context-dependent word representations to devise metrics for measuring the semantic similarity between adversarial and actual input sentences. The experiments in this dissertation include two common self-attentive neural networks, (a) Transformer for neural machine translation, and (b) BERT for sentiment. 政治大. and entailment classification. The compared methods are mainly recurrent neural networks.. 立. This work is unique in a number of aspects. First, we examine the robustness of uni- and. ‧ 國. 學. bi-directional self-attentive model as compared to RNNs. We provide detailed observations. ‧. of the internal variations of models under attack. And, we devise novel attack methods that take advantage of the embedding distance to maximize semantic similarity between real and. al. er. io. sit. y. Nat. adversarial examples.. n. To the best of our knowledge, this work brings forth the following contributions.. Ch. engchi. i Un. v. 1. We conduct comprehensive experiments to examine the robustness of LSTM, Transformer, and BERT. Our results show that both self-attentive models, whether pre-trained or not, are more robust than LSTM models. 2. We propose novel algorithms to generate more natural adversarial examples that both preserve the semantics and mislead the classifiers. 3. We provide theoretical explanations to support the statement that self-attentive structures are more robust to small adversarial perturbations.. 4 DOI:10.6814/NCCU202001426.

(20) 1.3 Outline The remaining chapters are organized as follows. Chapter 2 begins by mentioning a spectrum of essential knowledge which help the reader to grasp the main concepts of this dissertation. They include prior work on machine learning, especially regarding neural networks, methods and tasks that are prominent in natural language processing.. Next, Chapter 3. contains descriptions of our approaches for adapting pre-trained models on various NLP-related tasks, including classification, sequence labeling, sentiment analysis, entailment, and machine translation. Subsequently, the experiments on these tasks are presented in Chapter 4. We provide. 政治大. some discussions on the theoretical aspect of the experiments on the robustness of self-attentive. 立. models as compared with recurrent neural networks in Chapter 5. Finally, we conclude this work. ‧ 國. 學. with Chapter 6, in which we summarize the results from previous chapters as well as propose. ‧. advances that can be made in the future.. sit. y. Nat. n. al. er. io. 1.4 Publications. Ch. i Un. v. The current dissertation is based upon many previous work by the author and other collaborators. They are listed below.. engchi. 1. Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei Wei, Wen-Lian Hsu, Cho-Jui Hsieh, “On the Robustness of Self-Attentive Models,”, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). 2. Yu-Lun Hsieh, Yung-Chun Chang, Nai-Wen Chang, Wen-Lian Hsu, “Identifying Proteinprotein Interactions in Biomedical Literature using Recurrent Neural Networks with Long Short-Term Memory,” in Proceedings of the Eighth International Joint Conference on 5 DOI:10.6814/NCCU202001426.

(21) Natural Language Processing (IJCNLP 2017). 3. Yu-Lun Hsieh, Yung-Chun Chang, Yi-Jie Huang, Shu-Hao Yeh, Chun-Hung Chen, WenLian Hsu, “MONPA: Multi-objective Named-entity and Part-of-speech Annotator for Chinese using Recurrent Neural Network,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017). 4. Yu-Lun Hsieh, Shih-Hung Liu, Kuan-Yu Chen, Hsin-Min Wang, Wen-Lian Hsu, Berlin Chen, “Exploiting Sequence-to-Sequence Generation Framework for Automatic Abstractive Summarization,” in Proceedings of the 28th International Conference on Computa-. 政治大. tional Linguistics and Speech Processing (ROCLING 2016).. 立. ‧ 國. 學. Other work that are related to this topic includes:. ‧. 5. Yu-Lun Hsieh, Yung-Chun Chang, Chun-Han Chu, Wen-Lian Hsu, “How Do I Look?. sit. y. Nat. Publicity Mining From Distributed Keyword Representation of Socially Infused News. io. al. er. Articles”, in Proceedings of The Fourth International Workshop on Natural Language. v. n. Processing for Social Media (collocated with EMNLP 2016).. Ch. engchi. i Un. 6. Yu-Lun Hsieh, Shih-Hung Liu, Yung-Chun Chang, Wen-Lian Hsu, “Neural NetworkBased Vector Representation of Documents for Reader-Emotion Categorization,” in Proceedings of the 2015 IEEE International Conference on Information Reuse and Integration (IRI), pp. 569–573, San Francisco, CA, USA, 2015. 7. Yu-Lun Hsieh, Shih-Hung Liu, Yung-Chun Chang, Wen-Lian Hsu, “Distributed Keyword Vector Representation for Document Categorization,” in Proceedings of the 2015 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 245– 251, 2015. 6 DOI:10.6814/NCCU202001426.

(22) The following publications are also completed during the course of the Ph.D. 8. Zheng-Wen Lin, Yung-Chun Chang, Chen-Ann Wang, Yu-Lun Hsieh, Wen-Lian Hsu “CIAL at IJCNLP-2017 Task 2: An Ensemble Valence-Arousal Analysis System for Chinese Words and Phrases,” in Proceedings of the IJCNLP 2017, Shared Tasks. 9. Shih-Hung Liu, Kuan-Yu Chen, Yu-Lun Hsieh, Berlin Chen, Hsin-Min Wang, Hsu-Chun Yen, Wen-Lian Hsu, “Exploiting graph regularized nonnegative matrix factorization for extractive speech summarization,” in Proceedings of APSIPA 2016.. 政治大 Chun Yen, Wen-Lian Hsu, 立 “Exploring Word Mover’s Distance and Semantic-Aware. 10. Shih-Hung Liu, Kuan-Yu Chen, Yu-Lun Hsieh, Berlin Chen, Hsin-Min Wang, Hsu-. ‧ 國. 學. Embedding Techniques for Extractive Broadcast News Summarization.” in Proceedings of INTERSPEECH 2016.. ‧. sit. y. Nat. 11. Ting-Hao Yang, Yu-Lun Hsieh, You-Shan Chung, Cheng-Wei Shih, Shih-Hung Liu,. io. al. er. Yung-Chun Chang, Wen-Lian Hsu, “Principle-Based Approach for Semi-Automatic. v. n. Construction of a Restaurant Question Answering System from Limited Datasets,” in. Ch. engchi. i Un. Proceedings of the 2016 IEEE International Conference on Information Reuse and Integration (IRI), pp. 520–524, Pittsburgh, PA, 2016. 12. Nai-Wen Chang, Hong-Jie Dai, Yu-Lun Hsieh, Wen-Lian Hsu, “Statistical PrincipleBased Approach for Detecting miRNA-Target Gene Interaction Articles,” in Proceedings of the 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE), 2016.. 7 DOI:10.6814/NCCU202001426.

(23) Chapter 2 Background and Related Work 政治大 In this chapter, we briefly review the basis of natural language processing research, which is 立. ‧ 國. 學. the main target of this research. Then, we describe fundamental technologies that is later utilized in the rest of this work. We also introduce the readers to more details of the essential model, i.e.,. ‧. neural networks, that is under investigation of this dissertation.. n. Ch. engchi. er. io. al. sit. y. Nat 2.1 Natural Language Processing. i Un. v. Natural language processing (NLP) has been an essential part of the development of artificial intelligence since as early as the 1950s [69]. The main aim of this field is to design models and methods for a computer, or any machinery, to store, process, and eventually understand human languages. There are many levels of processing when dealing with language. Depending on the language, one may need to perform lemmatization or stemming first. These steps break up words into smaller, meaningful parts. Another related work is called “word segmentation,” where one has to find word boundaries within a sentence in which they are not obvious. Languages. 8 DOI:10.6814/NCCU202001426.

(24) like Chinese, Japanese, and Thai are some of the examples of this category. Part-of-speech (POS) labeling, or tagging, refers to assigning a label that represents the part-of-speech for each word. POS are classes of a word for the purpose of grammatical description. Some of those classes include: the verb, the noun, the pronoun, the adjective, the adverb, the preposition, the conjunction, the article, and the interjection [29]. Other major components of NLP include:. Syntactic (constituency) parsing. involves creating a structured representation of the syntactic. relationships of the words.. Dependency parsing. 政治大. aims at identifying the subject, object, and predicates of a sentence. It. 立. is done by labeling the relationship between one word and another.. ‧ 國. 學. Named entity recognition mainly focuses on finding the entities in a sentence, including. ‧. persons, places, organizations, etc.. sit. y. Nat. io. er. Sentiment analysis or opinion mining, refers to identify the affective content in text. It is. al. n. commonly employed to analyze product reviews, survey responses, social media, etc., for use in. i n C U h e nservice. applications such as marketing or customer gchi Entailment. v. detection targets to find out the directional relationship between statements. We. define a piece of text T and a hypothesis H. If T entails H, it is understood as if one reads T , one would infer that H is very likely to be true. The directionality factor means that the reverse does not hold, i.e., H does not necessarily entail T .. Machine translation. model generates the translation of one language to another based on the. training data in a bilingual corpus. It can be traced back to the idea proposed by Weaver [72]. 9 DOI:10.6814/NCCU202001426.

(25) that a machine can be utilized to handle this task. Traditionally, statistical machine translation was the common approach. In recent years, the application of neural networks has boosted the performance to a new peak.. Summarization In the past, more attention has been paid to extractive summarization, while abstractive summarization are rather rare. In view of the recent success of deep learning, the research on abstractive summarization has been growing. Recent literature has preliminarily verified the effectiveness of RNN on rewritten summarization of documents. Moreover, the contribution of the attention mechanism is also noticed by many. The characteristic of this. 政治大. mechanism is that it can increase the weight of key segments while generating text, thereby composing a better summary.. 立. ‧ 國. 學 ‧. 2.2 Neural Networks. y. Nat. er. io. sit. The NLP community is widely adopting Artificial Neural Networks (ANN) for various topics in this field. Thus, we begin this section by supplying an overview of neural networks and. n. al. its elementary ingredients.. Ch. engchi. i Un. v. An ANN can be regarded as a series of functions that are strung together, in which the majority are non-linear. It is important to note that, an ANN can learn to replicate linear or logistic regression, as well as other fundamental statistical machine learning models [40]. To illustrate, we consider the logistic regression problem an example here. A multi-class logistic regression can be represented as:. f (x) = W x + b (2.2.1) g(y) = softmax(y) 10 DOI:10.6814/NCCU202001426.

(26) where W ∈ RC×d denotes the weights in an ANN in matrix form, x ∈ Rd the input vector, b ∈ RC the bias, and y ∈ RC . C is the number of classes and d is the dimensionality of the input. We subsequently use θ as a shorthand to denote {W , b}, the set of parameters of this ANN. As such, g(f (x)) represents the logistic regression in the form of a composition of f and g. When we use ANN as an approach to this, the f is modeled by a fully connected NN, and g an activation function, in this case, softmax. Note that the activation function of ANNs is nonlinear. Technically, an ANN contains more than one “layer” of the above computation, therefore called “deep neural networks,” or DNN. These layers are connected or stacked together, with. 政治大 between the input and output, typically use activation functions other than the softmax, e.g., 立. the output of one of them being the input of the next. The “hidden” layers, or the ones that are. ‧ 國. 學. ReLU, tanh. For the output layer, the softmax and sigmoid functions are commonly used, due to the assumption that the output layer of an ANN can be considered as categorical or Bernoulli. ‧. distribution. Thus, the linear as well as logistic regression can be approximated by an ANN. y. Nat. io. sit. with just one layer, in which the former uses the identity function and the latter non-linear. n. al. er. activation function. Note that fully connected feed-forward ANN can sometimes be referred. Ch. to as a multilayer perceptron (MLP) [64].. engchi. i Un. v. h = σ(W1 x + b1 ) (2.2.2) y = softmax(W2 h + b2 ), where the σ denotes the activation function. It is worthy of mentioning that the layers can typically have independent weight matrices. Here, the first hidden layer contains a weight matrix W1 and bias b1 . The process of obtaining h, or the output of the layer, and feeding to the following layer as its input is called the “forward propagation”. As a result, the output of the. 11 DOI:10.6814/NCCU202001426.

(27) final layer, i.e., y, can be thought of as the output of the whole ANN. Recall that, mathematically the composition of more than one linear functions is equivalent to another linear function. It is perceivable that the model only has limited expressive power if we build it in this manner. Thus, the use of non-linear activation functions is at the heart of the success of deep neural networks. However, some more advanced ANNs can be designed to share those weights, sometimes referred to as “weight tying.” It is often used as a way of reducing the number of parameters in a model, as well as having the benefit of creating an inductive bias. Such bias may increase the ability of the model to generalize, and has been examined in previous work [62]. For example, a. 政治大 This technique is widely used in current deep learning models. 立. general pre-trained model can be transferred to various down-stream tasks due to the generality.. ‧ 國. 學. 2.2.1 Activation Functions. ‧. Activation functions are mathematical equations that determine the output of a cell in a. y. Nat. io. sit. neural network. Much like a real neural network in organisms, this function is imposed upon. n. al. er. each neuron in the network so that the output value represents the activated state. In addition,. Ch. i Un. v. this function can sometimes act as a regularization of the output.. engchi. For current ANNs, we require an extra characteristic of the activation function, which is that it must be differentiable. One of the most common functions is the sigmoid, denoted by the following: σ(x) =. 1 1 + e−x. (2.2.3). Another function, the softmax, is one that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the. 12 DOI:10.6814/NCCU202001426.

(28) exponentials of the input. Specifically,. exi softmax(x) = ∑K. xj j=1 e. for i = 1, . . . , K and x = (x1 , . . . , xK ) ∈ RK. (2.2.4). The sigmoid and softmax are generally adopted at the last, i.e., output layer, of the ANN. As for the hidden layers in between, we mostly employ the rectified linear unit (ReLU) function. It is written as the following: (2.2.5). ReLU(x) = max(0, x). 政治大 requires no further explanation. 立It outputs values in the range (−1, 1).. Yet another activation function, the hyperbolic tangent or tanh, is clearly defined and. ‧ 國. 學. Recently, other functions such as the Exponential Linear Units (ELU) and the Gaussian Error Linear Unit (GELU) [19, 30] has been proposed. ELU is aimed at speeding up the learning. ‧. process of deep neural networks and at the same time retaining a high classification accuracy.. y. Nat. io. sit. Part of the ELU is similar to ReLU, where the identity function is used to handle the positive. n. al. er. section of the input values, in order to tackle the vanishing gradient problem. On the other hand,. Ch. i Un. v. ELU has the unique property of allowing negative values. This trait serves as a normalization. engchi. factor very similar to the batch normalization process, which shifts the average activation of the units towards zero. But, unlike batch normalization, it requires no extra computation overhead.. 2.2.2 Recurrent Neural Networks Among different types of ANNs, the recurrent neural network (RNN) is one that is suitable for learning sequential input [21]. Recurrent neural networks have recently become a popular solution for single sequence as well as sequence-to-sequence tasks. In particular, prominent. 13 DOI:10.6814/NCCU202001426.

(29) network structures such as the ‘seq2seq’ proposed by [4, 15, 65] are increasingly being applied to a wide variety of problems. Moreover, some tasks that were considered as difficult are seeing explosive advances, including machine translation (MT) and language modeling (LM), when deep neural networks are incorporated. [38, 46] Typically, an encoder-decoder scheme is adopted to deal with these tasks, where the input sequence is encoded by an encoder, and the subsequent decoder generates a (sequential) output. Alternatively, we can view RNNs as a feed-forward neural network in which all layers share the same set of parameters. But, note that, rather than a fixed number of layers, the ‘depth’ (or. 政治大 see that each element in the input sequence can be treated as the input of each layer. To be more 立. number of layers) of this type of NN is dependent on the length of the input sequence. We can. ‧ 國. 學. formal, we can define an RNN as maintaining a vector Wh xt , which is a hidden state or memory to store at each time step t. Then, upon receiving input at time step t, the network updates its. ‧. state as the following:. Nat. n. al. Ch. sit. (2.2.6). er. io. yt = σy (Wh xt + by ). y. ht = σh (Wh xt + Uh ht1 + bh ). i Un. v. where σh and σy are activation functions. We can see from the above formulation that the weights. engchi. Uh is to transform the previous hidden state ht1 , and Wh the current input xt . A bias term bh can also be added. These calculations update the state vector ht . Subsequently, the RNN produces an output yt . However, an RNN can be less effective when modeling a sequential input if the number of time steps exceeds a certain amount, as we can induce from the above formulation. Nevertheless, an RNN depends on the previous calculation results to produce the next one. So, an increasing amount of efforts have been devoted to find a mechanism that can replace recurrence. One possible direction of research is to use attention [4]. The motivation behind this approach is that it can combine the efficiency of attention computation with the ability to 14 DOI:10.6814/NCCU202001426.

(30) learn positional information. It has been shown [70] to achieve outstanding performances on a multitude of language pairs in MT. We will introduce attention-based models later in this chapter.. 2.2.3 Long Short-term Memory A neural network cell named Long Short-term Memory, or LSTM, is proposed in an attempt to solve the deficiency of simple RNNs in learning a sequence of a longer length [27, 31]. It is evidences by the experimental results that LSTM can remember a longer span of the sequential input, as compared to traditional RNNs. This trait is important especially for NLP applications.. 政治大 information to “remember” as立 well as “forget.” They are done through the implementation of a In essence, an LSTM is an augmented RNN with extra weights to determine the amount of. ‧ 國. 學. forget gate ft , an input gate it , and an output gate ot . The values of all these gates are dependent on the input xt and previous memory ht . In this way, the cell learns to determine the portion of. ‧. its state to keep or discard through the forget gate. Formally, we recall the current input xt , the. y. Nat. io. sit. previous output ht−1 , and the current cell state ct as defined in the previous paragraph. Using. n. al. er. the following formulas, we enable the model to learn what to forget in the past and remember in the moment.. Ch. engchi. i Un. v. it = σ(Wi xt + Ui ht−1 + bi ) ft = σ(Wf xt + Uf ht−1 + bf ) ot = σ(Wo xt + Uo ht−1 + bo ) (2.2.7) c˜t = tanh(Wc xt + Uc ht−1 + bc ) ct = ft ◦ ct−1 + it ◦ c˜t ht = ot ◦ tanh(ct ) where σ designates the sigmoid function and “◦” means the element-wise product between. 15 DOI:10.6814/NCCU202001426.

(31) vectors. Graves et al. [26] proposed an intuitive extension to the LSTM, called the Bidirectional LSTM, or BiLSTM. In essence, it involves creating two separate LSTMs, of which one receives the original input sequence and the other a reversed sequence. These two LSTMs learn to model the sequential input separately and independently. This model is later widely used in virtually all NLP models [79].. 2.2.4 Training. 政治大 descent. However, a typical network 立 consists of more than one layer, preventing us from calcuThe training (learning) phase of a neural network currently relies on stochastic gradient. ‧ 國. 學. lating the gradient of the loss function. Thus, a dynamic process named “back propagation,” or “backprop (BP)”. is commonly adopted [63].. ‧. BP adopts the chain rule of calculus to determine the gradient of vectors. Let x ∈ Rm be. y. Nat. n. al. er. io. the network as:. sit. the input to a neural network, y ∈ Rn be the output of the penultimate layer, we can formulate. ni Cyh= g(x) U engchi z = f (y) = f (g(x)). v. (2.2.8). where z is a scalar output of the network. The gradient of z with respect to every element xi in x can be written as: ∂z ∂z ∂yj = Σj × ∂xi ∂yj ∂xi. (2.2.9). Since we are using vectors as input, the gradients ∇x z can be computed by the following multiplication:. ( ∇x z =. ∂y ∂x. )⊤ ∇y z. (2.2.10). 16 DOI:10.6814/NCCU202001426.

(32) where the Jacobian matrix of g is denoted by. ∂y ∂y , and ∈ Rn×m . This matrix contains all ∂x ∂x. partial derivatives. As described in [25], for all operations in the forward pass of the NN, the BP derives the Jacobian-gradient product. Let function. (2.2.11). ˆ y) J = L(y,. be the loss function of a certain task performed by a neural network that we need to minimize. This NN has K layers with weights Wk and biases bk , where k ∈ {1, · · · , K}. First we perform forward calculation using the input x starting from the first layer and ending at the last one. 政治大 ˆ Then, we obtain the loss by J. BP is therefore acting (output), yielding the output vector y. 立. ‧ 國. 學. ˆ in a reversed order. It starts by calculating the gradients ∇yˆ J with respect to the output y. Subsequently for all previous layers, it obtains the partial derivatives of parameters Wk and bk. ‧. for each layer until the first one is reached. It is perceivable that, in such a process, the derivatives. Nat. sit. y. of the deeper layers (close to the output) must be obtained first before the shallower layers can be. n. al. er. io. considered. This is due to the fact that the values of deep layers depend on those from shallow. i Un. v. layers. Finally, the SGD algorithm applies the gradients onto the parameters and completes. Ch. engchi. the optimizing step. Typically, this procedure is repeated for a certain amount of “epochs,” or traversals of the entire training dataset. Note that, when training RNNs, the gradients should be passed along the time step axis and not through the depth-wise procedure of other types of NNs. To do that, we must employ a technique known as back-propagation through time (BPTT) [73]. But the practical limitation of hardware prevent us from doing BP indefinitely. So, we normally set a certain window on the time axis for BP to operate in. Interestingly, one might think a wider window can help the RNN to see and model a wider context. Whereas in practice, we often find that the network is unable 17 DOI:10.6814/NCCU202001426.

(33) to learn anything. After careful inspection, it is found that the problem of exploding or vanishing gradients exists in these situations. Consider for a moment the forward operation of an RNN, in which the state vector is continuously being updated by multiplication of the weights. Thus, when BP is in action, the gradients would also undergo the same process multiple times. It is imaginable that these gradients may become exceedingly large (explode) or small (vanish). As a result, the network is unable to be optimized. Therefore, the LSTM (mentioned in Section 2.2.3) is proposed to alleviate the exploding or vanishing gradient problem.. 2.3 Attention Mechanisms 政. 治. 大立 The “attention” in neural networks can be thought of as a type of weighting. There are. ‧ 國. 學. multiple functions to obtain attention scores, but the additive method [4] and multiplicative. ‧. method [46] are among the most widely used ones. The multiplicative method is sometimes. sit. y. Nat. factored by √1dk , which is used in BERT-related models. In particular, the form that is commonly. n. al. er. io. used is called “Scaled Dot-Product Attention”. Consider that, in an attention block, the input is a. i Un. v. series of vectors named “queries” and “keys” with dimension dk , and “values” of dimension dv .. Ch. engchi. Then, the dot product between a query and all keys are calculated and normalized by dividing with. √. dk . Finally, the softmax function is applied and the weights on the values can be output.. Note that, we can parallelize the attention calculation by computing them on a batch of inputs at the same time. More specifically, the query, key, and value vectors are collated into matrices Q, K, V , respectively. We then perform matrix operations to obtain outputs as:. QK T Attention(Q, K, V ) = softmax( √ )V dk. (2.3.1). 18 DOI:10.6814/NCCU202001426.

(34) Conceptually, attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. On the other hand, additive attention employs a fully connected NN to determine the compatibility or similarity between two vectors. The score of additive attention is calculated as follows: Attentionadd (st , hi ) = va⊤ tanh(Wa [st ; hi ]). (2.3.2). 政治大. where s, h are state vectors and v, W are parameters that the model needs to learn. We can. 立. observe the above definitions and find that multiplication in dot-product attention can take. ‧ 國. 學. advantage of modern GPU hardware to perform fast calculations. This is one of the main reasons. Nat. y. ‧. for recent improvements in using attention for language modeling.. n. al. er. io. sit. 2.3.1 Self-Attention. v. Recently, Vaswani et al. [70] proposed the “Transformer” model, which is a novel method. Ch. engchi. i Un. that solely depends on self-attention to learn vector representations of a sequence. The heart of Transformer is a unit of multi-head self-attention mechanism. It transforms the input vectors into a representation formed by multiple mixtures learned by the model. For each head, The input is first linearly projected by a set of three weight matrices as in the previous section to three vectors, (Q, K, V ). Then, an attention weight is calculated using the dot-product attention in Eq. 2.3.1. The reason it is called the “self”-attention is that the attended elements are the input sequence itself. Moreover, note that the number of head effectively indicates the number of weight matrix sets. Figure 2.1 shows a schematic of the Transformer model. The trait of. 19 DOI:10.6814/NCCU202001426.

(35) Transformer. output. Layer normalization. × Wo. ⊕. concat. Feed-forward. Scaled dot-product attention. n. he. ad s. Layer normalization. Q. K. V. × WQ. × WK. × WV. ⊕ Multi-head self attention. Embedding. hi. 立. 政治大. Figure 2.1: The multi-head attention module in a Transformer block.. ‧ 國. 學. this type of models is that it frees us from the recurrent part of previous neural networks or. ‧. even convolution calculations. The Transformer model utilize entirely the attention weights to. sit. y. Nat. denote global correspondence between input and output symbols (words). Thus, the degree. io. al. er. of parallelization can be much higher than RNNs. The performance is indicated by training a. v. n. machine translation system that achieves state-of-the-art outcomes in just 12 hours.. Ch. engchi. i Un. However, this type of model does not come without weaknesses. First, it cannot directly learn the order of the input due to the fact that there is no recurrent state or convolution. So, another new type of embedding, Position Embedding, is incorporated to represent the relative order of the elements in the input sequence. They are defined as follows.. xi = (embword i ⊕ embtag i) + embpos i. (2.3.3). where embpos i is called the Position Embedding of the i-th position. We use the sine and cosine. 20 DOI:10.6814/NCCU202001426.

(36) functions of different frequencies to compose position embeddings as stated in [70]. It is added to the word embeddings which represent linguistic information of the words, with dimensions identical to other embeddings so that they are compatible. Specifically, these functions are used to encode the position information [70]:. embpos2i = sin(pos/100002i/dmodel ) (2.3.4) 2i/dmodel. embpos2i+1 = cos(pos/10000. ). where pos is the position and i is the dimension. We can interpret this formulation as using a. 政治大. sinusoid to encode one dimension with position embedding. The wavelengths denoted by these. 立. functions constitute a geometric sequence (a sequence of numbers where each one is determined. ‧ 國. 學. by multiplying the previous one by a fixed, non-zero value) from 2π to (10000 · 2π). It is hypothesized that this formulation can incorporate the relative position of the input elements. ‧. into the embeddings. The reason is explained as follows. Suppose we have a word embedding. y. Nat. n. al. er. io. embpos+k .. sit. embpos , the embedding of a word that is k steps away can be denoted by a linear combination of. Ch. i Un. v. It is worthy of noting that there are various means of learning positional information [23].. engchi. The learned positional embeddings [23] are examined and the results showed that they have virtually no difference [70]. The sine and cosine functions are eventually selected due to their potential in modeling the indefinitely long sequence that may not be present in the training data. The characteristics of these functions can help the network extrapolate to unseen lengths.. 21 DOI:10.6814/NCCU202001426.

(37) 2.4 Adversarial Attack Robustness of neural network models has been a prominent research topic since Szegedy et al. [66] discovered that CNN-based image classification models are vulnerable to adversarial examples. An abundant amount of work has been dedicated to investigating the robustness of CNN models against adversarial attacks [7, 12, 24, 54]. However, attempts to examine the robustness of NLP models are relatively few and far between. Previous work on attacking neural NLP models include using Fast Gradient Sign Method [24] to perturb the embedding of RNNbased classifiers [43, 53], but they have difficulties mapping from continuous embedding space. 政治大. to discrete input space. Ebrahimi et al. [20] propose the ‘HotFilp’ method that replaces the. 立. word or character with the largest difference in the Jacobian matrix. Li et al. [41] employ. ‧ 國. 學. reinforcement learning to find the optimal words to delete in order to fool the classifier. More. ‧. recently, Yang et al. [77] propose a greedy method to construct adversarial examples by solving. sit. y. Nat. a discrete optimization problem. They show superior performance than previous work in terms. io. er. of attack success rate, but the greedy edits usually degrade the readability or significantly change. al. the semantics. Alzantot et al. [3] propose to use a pre-compiled list of semantically similar words. n. iv n C h esuccessful to alleviate this issue, but leads to lower as shown in our experiments. We thus i U n g c hrate include the latest greedy and list-based approaches in our comparisons.. In addition, the concept of adversarial attacks has also been explored in more complex NLP tasks. For example, Jia and Liang [37] attempt to craft adversarial input to a question answering system by inserting irrelevant sentences at the end of a paragraph. Cheng et al. [14] develop an algorithm for attacking seq2seq models with specific constraints on the content of the adversarial examples. Belinkov and Bisk [5] compare typos and artificial noise as adversarial input to machine translation models. Also, Iyyer et al. [35] propose a paraphrase generator. 22 DOI:10.6814/NCCU202001426.

(38) model to generate legitimate paraphrases of a sentence. However, the semantic similarity is not guaranteed. In terms of comparisons between LSTM and Transformers, Tang et al. [67] show that multi-headed attention is a critical factor in Transformer when learning long distance linguistic relations.. 2.4.1 Pre-training and Multi-task Learning The widely adopted workflow for BERT-related models involves two steps, i.e., pre-training (self-supervised) and fine-tuning (supervised). The first pre-training procedure involves two. 政治大 regards the prediction of randomly 立 masked input words, and NSP aims at predicting the targets, i.e., masked language modeling (MLM) and next sentence prediction (NSP). MLM. ‧ 國. 學. relationship of two input sentences, namely, does the latter follow the former in the original corpus. Subsequently, the model is fine-tuned for different downstream tasks, where fully-. ‧. connected networks are added after the final encoding layer per the requirements of various. io. sit. y. Nat. end tasks.. n. al. er. Such a scheme can be regarded as a type of multi-task learning. In recent ML research,. Ch. i Un. v. this approach has been proven successful for a wide spectrum of applications that go beyond. engchi. NLP [16]. Traditionally, ML model focuses on a single task in the training process. In doing so, we are stripping it of information potentially useful for other tasks as well. If the supervised goal of another task contains knowledge which can assist the model to learn quicker and better, it is helpful to train both tasks together. Therefore, recent approaches attempt to construct a common, general representation scheme for all tasks before specifying the supervised goals. As the experiments showed, MTL can boost the model’s ability to generalize across different tasks [8]. More specifically for NLP tasks, pre-training the word representation models such as ELMo or GPT-2 [60, 61] have been verified to greatly boost their effectiveness in a broad variety 23 DOI:10.6814/NCCU202001426.

(39) of applications. Note that, the pre-training loss of these models may be different. But, as long as it is designed to incorporate helpful auxiliary information such as linguistic knowledge, the resulting model can be stronger when learning a new task.. 2.5 Evaluation Metrics Throughout this dissertation, we use standard metrics in NLP to evaluate the performance of our models. Binary classification typically adopts the accuracy measurement, defined as: TP治 + TN 政 , TP + FP + TN +大 FN. (2.5.1). Acc =. 立. ‧ 國. 學. where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives respectively. For multi-class classification, where each input sample is classified. ‧. as one of many classes, the F1 -score is typically used. It is defined as:. n. al. Ch. 2 × Precision × Recall , Precision + Recall. er. io. sit. y. Nat F1 -score =. engchi. i Un. v. (2.5.2). where Precision and Recall are defined as:. TP TP + FP TP Recall = TP + FN. Precision =. (2.5.3). For machine translation, it is typical to use the BLEU (bilingual evaluation understudy) score to evaluate the quality of the translated text [55].. It is designed to measure the. correspondence between the output of a machine learning model and the golden translation. BLEU was one of the first metrics to obtain a high correlation with human judgments of quality, 24 DOI:10.6814/NCCU202001426.

(40) and remains one of the most popular automatic as well as cost-effective metrics. Evaluation of the score involves calculating the similarity between pieces of the translation individually and the reference text. Then, the overall score is the mean over the entire corpus. It should be noted that grammaticality, intelligibility, semantic quality cannot be evaluated in this manner. The value of BLEU score lies between 0 and 1, where 1 indicates that the automatic translation is identical to a segment in the golden translations. For automatic summarization, we adopt the commonly used “Recall-Oriented Understudy for Gisting Evaluation” (ROUGE) scores [44]. The ROUGE method aims at calculating the ratio. 政治大 can be N-gram or character sequences. Specifically, we use three ROUGE calculation methods: 立 of the unit overlap between the generated results and the golden summary. The unit used here. ‧ 國. 學. ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (Longest Common Subsequence) scores. To improve legibility, we will abbreviate them as R-1, R-2 and R-L henceforth.. ‧. Intuitively, R-1 can be thought of as to represent the amount of information of automatic. y. Nat. io. sit. summaries, whereas R-2 is to evaluate the overall fluency of said summaries. Finally, R-L can. n. al. er. be regarded as the coverage rate of the summary over the original article.. Ch. engchi. i Un. v. 25 DOI:10.6814/NCCU202001426.

(41) Chapter 3 Methods 政治大 This chapter provides descriptions of various neural architectures and how to adapt them 立. ‧ 國. 學. for the downstream tasks, which include sequence classification, part-of-speech tagging, named entity recognition, sentiment analysis, entailment, translation, and summarization.. ‧ y. Nat. er. io. al. sit. 3.1 Neural Networks. v. n. Artificial neural networks (ANN) have become a prominent tool for natural language. Ch. engchi. i Un. processing in recent years. As a result, a wide variety of network structures are used as the basis of models for the experiments in this work. Before going into details of each task, the model architectures are briefly described in the following sections.. 3.1.1 Recurrent Neural Networks In essence, natural language inputs consist of a sequence of words or sub-word tokens, in which the order cannot be freely alternated. Therefore, RNNs emerge as an intuitive choice. We propose an approach for identifying protein-protein interaction (PPI) in biomedical. 26 DOI:10.6814/NCCU202001426.

(42) literature using RNN with LSTM cells.. We employ a straightforward extension named. Bidirectional RNN, which encodes sequential information in both directions (forward and backward) and concatenate the final outputs. In this way, the output of one time step will contain information from its left and right neighbors. For classification tasks including sentiment analysis and entailment detection, we use a Bidirectional LSTM [31] with an attention [4] layer as the sentence encoder, and a fully connected layer for the classification task. Similarly, for tasks such as POS and NER where the label of one character can be determined by its context, bidirectional learning can be beneficial.. 政治大 encoder and decoder are a 2-layer stacked Bi-LSTM with 512 hidden units. 立. For machine translation, we employ a common seq2seq model [65], in which both the. ‧ 國. 學. For abstractive summarization, we use a layer of LSTM network with attention mechanism, and compare the difference between uni-directional and bi-directional networks, as well as the. ‧. impact of the LSTM cell dimension, word vector dimension and other parameters.. sit. y. Nat. n. al. er. io. 3.1.2 Self-Attentive Models. Ch. i Un. v. Self-attentive models including Transformer [70] and “Bidirectional Encoder Representa-. engchi. tions from Transformers,” shortened as BERT [18], rely on the attention mechanism [46] to learn a context-dependent representation, or encoding. As such, self-attention has been successfully applied in several tasks. Similar to bidirectional LSTM, this type of encoder takes x0 , x1 , · · · , xn as the input, and produces context-aware word representations ri of all positions 0 ⩽ i ⩽ n. We employ a stack of N identical self-attention layers, each having independent parameters. The classification problems adopt the BERT model with an identical setup to the original paper, in which BERT is used as an encoder that represents a sentence as a vector. This vector is then used by a fully connected neural network for classification. Note that models are tuned 27 DOI:10.6814/NCCU202001426.

(43) Class Label. BERT Transformer. .... Transformer. Transformer. .... Transformer. .... Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. W1. W2. W3. .... Transformer. n layers. 立. Wn. .... 學. ‧ 國. W0. 政治大. “ [CLS] ” “ President ” “ of ”. “ the ”. “ [SEP] ”. (a) Single sentence classification. ‧. Class Label. Transformer. Transformer. W0. Ch. .... n n layers. Transformer. .... Transformer. .... Transformer. er. io. al. Transformer. sit. y. Nat. Transformer. BERT. engchi. Transformer. W1. Transformer. W2. “ [CLS] ” “ President ” .... i Un. v. Transformer. Wn. Wn+1. “ [SEP] ”. “ She ”. Wm. .... “ [SEP] ”. (b) Sentence pair classification. Figure 3.1: Classification of sentence and sentence pair using BERT. 28 DOI:10.6814/NCCU202001426.

(44) (Ignored). B-PER. .... I-PER. O. BERT Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. .... Transformer. Transformer. .... Transformer. .... Transformer. n layers. W0. W1. W2. Wn. “ [CLS] ” “ President ” .... .... “.”. Figure 3.2: Named entity recognition using BERT. 政治大 separately for each task. Figure 3.1 denotes how to model BERT model for classification tasks. 立. ‧ 國. 學. Figure 3.2 denotes how to perform sequence labeling, such as NER, using BERT. Figure 3.3 illustrates the approach for building a question answering system with BERT.. ‧. In addition, we tried to determine the effect of pre-training by testing a compact version of. y. Nat. io. sit. BERT, named BERTnopt . It comprises of three self-attention layers instead of 12.. n. al. er. To the best of our knowledge, machine translation models do not typically employ BERT.. Ch. i Un. v. Therefore, for our MT experiments, a Transformer encoder-decoder model is utilized.. engchi. 3.2 Adversarial Attack Methods In order to test the robustness of various neural models, we include five methods for generating adversarial examples (attacks). These methods have a common goal, which is to find and replace only one of the elements in an input sequence such that the prediction of the model is incorrect. They are introduced in this section. The first method is based on random word replacement, which serves as the baseline. The. 29 DOI:10.6814/NCCU202001426.

(45) (Ignored). Start / End Position. BERT Transformer. Transformer. Transformer. Transformer. Transformer. Transformer. .... Transformer. Transformer. .... Transformer. .... Transformer. n layers. W0. W1. W2. “ [CLS] ” “ President ” .... Wn. Wn+1. Wm. “ [SEP] ”. “ She ”. .... “.”. 政治大 Figure 3.3: Question answering using BERT 立 Question. Paragraph. ‧ 國. 學. second (list-based) and third (greedy) methods are adapted from previous work [3, 14]. The fourth (constrained greedy) and fifth (attention-based) are proposed by the current work.. ‧ y. Nat. al. er. io. sit. 3.2.1 Random Attack. n. This basic attack method selects a word in the original sequence and replaces it with another. Ch one in the vocabulary, both randomly.. iv n i U estimate the effect of randomness, eInnorder g ctoh fairly. this attack is repeated for 105 trials and averaged to obtain the overall score. It is denoted as RANDOM.. 3.2.2 List-based Attack The second method is recently proposed by Alzantot et al. [3], denoted as LIST. LIST employs a list of semantically similar words (i.e., synonyms), and manages to replace a word in the input sentence with another from the list to construct adversarial examples. In other words,. 30 DOI:10.6814/NCCU202001426.

(46) Table 3.1: Illustrative examples of semantically similar words. Word. Similar Words. abandon forgo, renounce, relinquish, forego, forswear, forsake, abdicate, waive, abandons, abandoning, renounces abate. lessen, downsize, reduce, shortening, mitigate, mitigating, reducing, mitigation, curtail, lighten, alleviate, minimize, shorten ···. zucchini. spinach, broccoli, eggplant, celery, leeks, onion, artichokes, cauliflower, tomatoes, chard, eggplants, sauteed, tomato, artichoke, courgettes, radishes, shallots, okra, arugula, beets. 政治大. the list is used to replace a word with one of its synonyms; this process is repeated for every. 立. word in the input sentence until the target model makes an incorrect prediction. That is, for every. ‧ 國. 學. sentence, we start by replacing the first word with its synonyms, each forming a new adversarial. ‧. example. If none of these successfully misleads the model, we move to the next word (and the. sit. y. Nat. first word remains unchanged), and repeat this process until either the attack succeeds or all. io. er. words have been tried.. al. A list of semantically similar words can be found in Table 3.1. We can see that this list. n. iv n C enables us to find very closely related h words e n(synonyms) g c h i Uto perform attacks. 3.2.3 Greedy Select & Greedy Replace. The third method (denoted as GS-GR) greedily searches for the weak spot of the input sentence [77] by replacing each word, one at a time, with a “padding” (a zero-valued vector) and examining the changes of output probability. After determining the weak spot, GS-GR then replaces that word with a randomly selected word in the vocabulary to form an attack. This process is repeated until the attack succeeds or all words in the vocabulary are exhausted.. 31 DOI:10.6814/NCCU202001426.

(47) 3.2.4 Greedy Select with Embedding Constraint Although the GS-GR method potentially achieves a high success rate, the adversarial examples formed by GS-GR are usually unnatural; sometimes GS-GR completely changes the semantics of the original sentence by replacing the most important word with its antonym, for example: changing “this is a good restaurant” into “this is a bad restaurant.” This cannot be treated as a successful attack, since humans will notice the change and agree with the model’s output. This is because GS-GR only considers the classification loss when finding the replacement word, and largely ignore the actual semantics of the input sentence.. 政治大. To resolve this issue, we propose to add a constraint on sentence-level (not word-level). 立. embedding: the attack must find a word with the minimum L1 distance between two embeddings. ‧ 國. 學. (from the sentences before and after the word change) as the replacement. This distance constraint requires a replacement word not to alter the sentence-level. ‧. semantics too much. This method is denoted as GS-EC. In the experimental results, we show. y. Nat. er. io. sit. that the GS-EC method achieves a similar success rate as GS-GR in misleading the model, while being able to generate more natural and semantically-consistent adversarial sentences.. n. al. Ch. 3.2.5 Attention-based Select. engchi. i Un. v. We conjecture that self-attentive models rely heavily on attention scores, and changing the word with the highest or lowest attention score could substantially undermine the model’s prediction. Therefore, this attack method exploits and also investigates the attention scores as a potential source of vulnerability. This method first obtains the attention scores and then identifies a target word that has the highest or lowest score. Target word is then replaced by a random word in the vocabulary, and this process is repeated until the model is misled by the generated. 32 DOI:10.6814/NCCU202001426.

(48) adversarial example. These methods are denoted as ASMIN -GR that replaces the word with the lowest score, and ASMAX -GR with the highest score. Furthermore, the constraint on the embedding distance can also be imposed here for finding semantically similar adversarial examples; these methods are referred as ASMIN -EC and ASMAX EC, respectively. As a pilot study, we examine the attention scores on the first and last layers of the BERT model for understanding the model’s behavior under attacks.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. 33 DOI:10.6814/NCCU202001426.

(49) Chapter 4 Experiments 政治大. 4.1 Text Sequence立 Classification in Biomedical Literature. ‧ 國. 學. O ∈ { Positive, Negative }. ‧. Output. Nat. y. Fully connected. io. n. Forward. Word Embedding. ▵▵▵▵▵▵. al. ▵▵▵▵▵▵. er. sit. ▵▵▵▵▵▵. Backward. Bidirectional LSTM. ■■. i ◦ ◦ ◦ ◦ ◦U ◦n ◦ ◦C ◦ ◦h◦ ◦ engchi. v. ◦◦◦◦◦◦. ▵▵▵▵▵▵. ◦◦◦◦◦◦. ●●●●●●●. ●●●●●●●. ....... ●●●●●●●. W1 “A”. W2 “kinase”. ....... Wn “activation”. Figure 4.1: Recurrent neural network-based PPI classification model.. This experiment concerns with evaluating an approach to the problem of classifying the textual description of protein-protein interaction (PPI) in biomedical literature. It is one of the essential parts of this field, especially because it can serve as the basis of building a knowledge base and/or ontology for the entities such as molecules and cells within the sentence [52]. The 34 DOI:10.6814/NCCU202001426.

(50) Table 4.1: Descriptive Statistics of AIMed and BioInfer, the two largest PPI corpora. Corpus. Number of Sentences Number of Positive/Negative Protein Pairs. AIMed. 1,955. 1,000 / 4,834. BioInfer. 1,100. 2,534 / 7,132. rapid growth of the amount of research papers in the world strengthens the need for this task, and newer methods are much in demand. Here we based on RNNs to capture the long-term relationships among words in order to identify PPIs. The proposed model is evaluated on two largest PPI corpora, i.e., AIMed [6] and. 政治大. BioInfer [58] using cross-validation (CV) and cross-corpus (CC) settings. Figure 4.1 illustrates. 立. the structure of this neural network. The descriptive statistics of the datasets used in this. ‧ 國. 學. experiment is listed in Table 4.1. We adopt 10-fold cross-validation (CV) and cross-corpus (CC) testing schemes for evalu-. ‧. ation. The evaluation metrics are the precision, recall, and F1-score for both schemes.. y. Nat. er. io. sit. Compared methods include the shortest dependency path-directed constituent parse tree (SDP-CPT) method [59], in which the tree representation generated from a syntactic parser. al. n. iv n C is refined by using the shortest dependency between two entity mentions derived from a h e npath gchi U dependency parser; A knowledge-based approach PIPE [10] that extracts linguistic interaction patterns and learned by a convolution tree kernel; A composite kernel approach (CK) [51] which combines several different layers of information from a sentence with its syntactic structure by using several parsers; and a graph kernel method (GK) [2] that integrates parse structure subgraph and a linear order sub-graph. We further compare with recent NN-based approaches: sdpCNN [34] which combines CNN with shortest dependency path features; McDepCNN [57] that uses positional embeddings along with word embeddings as the input, and a tree kernel using. 35 DOI:10.6814/NCCU202001426.