深度學習於中文句子之表示法學習 - 政大學術集成

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University 碩士論文 Master’s Thesis. 立. 政治大. ‧ 國. 學 ‧. 深度學習於中文句子之表示法學習 Deep Learning Techniques for Chinese sit. y. Nat. n. al. er. io. Sentence Representation Learning Ch. engchi. i Un. v. 研究生：管芸辰指導教授：蔡銘峰. 中華民國一百零七年二月 February 2018.

(2) 107. 碩士論文. 立. 政治大. ‧. ‧ 國. 學. 深度學習於中文句子之表示法學習. n. er. io. sit. y. Nat. al. 政治大學資訊科學系. 管芸辰. Ch. engchi. i Un. v.

(3) 深度學習於中文句子之表示法學習 Deep Learning Techniques for Chinese Sentence Representation Learning 研究生：管芸辰指導教授：蔡銘峰. Student：Yun Chen Kuan Advisor：Ming-Feng Tsai. 國立政治大學資訊科學系. 立. 治政碩士論文大. ‧ 國. 學 ‧. A Thesis. er. io. sit. y. Nat. submitted to Department of Computer Science National Chengchi University in partial fulfillment of the Requirements. n. a l for the degree of i v n C h Master U engchi in Computer Science. 中華民國一百零七年二月 February 2018.

(4) 深度學習於中文句子之表示法學習. 中文摘要本篇論文主要在探討如何利用近期發展之深度學習技術在於中文句子分散式表示法學習。近期深度學習受到極大的注目，相關技術也隨之蓬勃發展。然而相關的分散式表示方式，大多以英文為主的其他印歐語系作為主要的衡量對象，也據其特性發展。除了印歐語系外，另外漢藏語系及阿爾泰語系等也有眾多使用人口。還有獨立語系的像日語、韓語等語系存在，各自也有其不同的特性。中文本身屬於漢藏語系，本身具有相當不同的特性，像是孤立語、聲調、量詞等。近來也有許多論文使用多語系的資料集作為評量標準，但鮮少去討論各語言間表現的差異。本論文利用句子情緒分類之實驗，來比較近期所發展之深度學習之技術與傳統詞向量表示法的差異，我們將以TF-IDF為基準比較其他三個PVDM、Siamese-CBOW及Fasttext的表現差異，也深入探討此些模型對於中文句子情緒分類之表現。. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 1. i Un. v.

(5) Deep Learning Techniques for Chinese Sentence Representation Learning. Abstract The paper demonstrates how the deep learning methods published in recent years applied in Chinese sentence representation learning. Recently, the deep learning techniques have attracted the great attention. Related areas also grow enormously. However, the most techniques use Indo-European languages mainly as evaluation objective and developed corresponding to their properties. Besides Indo-European languages, there are Sino-Tibetan language and Altaic language, which also spoken widely. There are only some independent languages like Japanese or Korean, which have their own properties. Chinese itself is belonged to Sino-Tibetan language family, and has some characters like isolating language, tone, count word...etc. Recently, many publications also use the multilingual dataset to evaluate their performance, but few of them discuss the differences among different languages. This thesis demonstrates that we perform the sentiment analysis on Chinese Weibo dataset to quantize the effectiveness of different deep learning techniques. We compared the traditional TF-IDF model with PVDM, SiameseCBOW, and FastText, and evaluate the model they created.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 2. i Un. v.

(6) Content 中文摘要. 1. n. Ch. engchi. 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. y. sit. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. er. io. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. Nat. al. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 政治大. ‧ 國. 立. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 學. Abstract 1 Introduction . . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . 1.2 Purpose . . . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . . 2.1 Traditional Approach . . . . . . . . . . . . 2.2 Chinese Related Sentiment Analysis . . . . 2.3 Advanced Approach . . . . . . . . . . . . 3 Methodology. . . . . . . . . . . . . . . . . . 3.1 TF-IDF + SVM . . . . . . . . . . . . . . . 3.2 Fasttext . . . . . . . . . . . . . . . . . . . 3.3 Paragraph Vector . . . . . . . . . . . . . . 3.4 Siamese-CBOW . . . . . . . . . . . . . . . 4 Experiments . . . . . . . . . . . . . . . . . . 4.1 Experimental Settings . . . . . . . . . . . . 4.2 Preprocess . . . . . . . . . . . . . . . . . . 4.3 PVDM . . . . . . . . . . . . . . . . . . . . 4.4 FastText . . . . . . . . . . . . . . . . . . . 4.5 Siamese-CBOW . . . . . . . . . . . . . . . 4.6 Experimental Results . . . . . . . . . . . . 5 Discussions . . . . . . . . . . . . . . . . . . 5.1 Discussion . . . . . . . . . . . . . . . . . . 5.2 Baseline . . . . . . . . . . . . . . . . . . . 5.3 Siamese-CBOW . . . . . . . . . . . . . . . 5.4 PVDM . . . . . . . . . . . . . . . . . . . . 5.5 FastText . . . . . . . . . . . . . . . . . . . 5.5.1 N-grams Evaluation . . . . . . . . 5.5.2 Subword Information . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . .. i Un. v. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 1 1 2 3 3 4 4 6 6 6 8 9 10 10 11 12 12 13 13 18 18 19 19 20 21 24 24 26.

(7) Figure Content Figure 3.1 Production quantization: It use predefined centroids to proximate the distance of 2 points. There are two type of PQ, symmetric and asymmetric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.2 The architecture of FastText: the image is cited from [8]. . . . . . Figure 3.3 Paragraph vector: the images show the difference of two models, and the images are from [12]. . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.4 The architecture of Siamese-CBOW, the image is from original paper[9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 立. 政治大. . . . .. 10 14 15 16. Figure 5.1. Visualization of vector space for 2 dataset . . . . . . . . . . . . .. 23. ‧ 國. . . . .. . . . .. . . . .. . . . .. n. er. io. sit. y. Nat. al. Ch. engchi. 4. i Un. v. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 9. The emoticons in WeiBo . . . . . . . . The confusion matrix for TF-IDF+ SVM Confusion Matrix of Siamese-CBOW . . Confusion matrix of FastText . . . . . .. ‧. . . . .. 9. Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4. 學. . . . .. 7 8.

(8) Table Content Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7. Tag Category . . . . . . . . . . . . . . . . . Number for categories . . . . . . . . . . . . FestText Dataset . . . . . . . . . . . . . . . Results: the best accuracy of different models Fasttext . . . . . . . . . . . . . . . . . . . . Results: the accuracy of Siamese-CBOW . . Result of PVDM . . . . . . . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 10 11 12 13 13 13 14. Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6 Table 5.7. The feature extracted from TF-IDF . . . . . . . . . . . . The most similar 5 words of I (我) in 2 models of PVDM Similar words with “nˇı”(you) in Pinyin dataset . . . . . words with low norm in 3 dataset in FastText . . . . . . ngrams constitution of “我們的”(ours) . . . . . . . . . . ngrams constitution of “wˇoménde” (ours) in pinyin . . . Query the word non-existing in dataset . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 19 21 21 22 24 25 25. 立. 政治大. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 5. i Un. v.

(9) Chapter 1 Introduction 1.1. Background. 政治大 How to make the sentence embedding with its own semantics more precisely is a study 立 of interest, since it is beneficial for several NLP tasks like machine translation, sentiment. ‧ 國. 學. analysis. The volume of internet text grows so enormously and rapidly, and the new derivative or new word keep growing as well. Chinese forums, blogs or microblog expand. ‧. especially rapidly. It becomes more critical for many applications to make the information can be extracted more efficiently and precisely.. y. Nat. sit. al. er. io. Recently word2vec[14] is considered to work for evaluating word semantics in general cases. Additionally, the character of word2vec is invariant to the languages. Never-. v. n. theless, the embedding in sentence or phase level is more complicated, it is related to the. Ch. i Un. sentence structure, intention or context. Traditional techniques like n-gram, bag-of-words. engchi. or part-of-speech (POS) tagging can not overcome some difficulties like high dimension or inability to be generalized enough. The recent studies tried to vectorize the sentences with deep learning approaches in more general way and make it invariant to the languages properties. Most dataset for NLP work are still in Indo-European Languages, including English, Spanish. Wikipedia suggested that 46% people speak Indo-European as their first language. Although Indo-European languages are spread widely, there are also other languages like Chinese, Japanese, which own their special properties. However, the properties of languages are considered little in most publications.. 1.

(10) 1.2. Purpose. So far, most studies about distributed representation of sentence are conducted mainly in English, or in multilingual environments, which are from the forums, review platforms contributed by the worldwide users. Most techniques also are aimed at being invariant to language properties or applicable to multilingual environment. However, few of them evaluate the effectiveness of these techniques to other languages, neither they evaluate the multilingual dataset with considering the characteristic of other languages. We are interested if those models also work in Chinese or other languages, and if the algorithm is invariant to the language grammar or language property. In this paper, we demonstrate the modern methods on practical data, and compare it with traditional methods.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 2. i Un. v.

(11) Chapter 2 Related Work 2.1. 治. 政 Traditional Approach. 立. 大. ‧ 國. 學. K. Dashtipour et al [4] summarized both corpus-base and lexicon-base techniques purposed recently and listed the languages those techniques aiming at, and there are some. ‧. innovative mothods combined both approaches. The advantage of corpus-base is that it is dictionary-free, but it requires relatively larger corpus to build the model, while lexiconbased approach depends mainly on existing resources to detect the sentiment. When. y. Nat. sit. lexicon-based approach comes to the informal articles contributed by netizens, it may. er. io. suffer some troubles like misspelling, abbreviation ,words or metaphor...etc, neither can it take the sequence of words into consideration.. al. n. iv n C The basic corpus-based approach TF-IDF is considered to be able to generate h like i U e h n c g relatively good precision. However, both approaches may utilize some keywords in the sentences rather than sentiment of sentence itself. In real world, we often use negation or irony to present our feeling rather than solely keywords. To solve the problems from rapidly-evolved languages, there are both semi-supervised and unsupervised approaches introduced as well. The another problem from the approach is the sparse matrix and high dimension vector due to the complexity of the language nature. Sparse matrix results in inefficience of classifiers and difficulty of scaling up. Additionally, the relationship of synonymy can not be modeled. A classic method for generating dense vectors is to utilize singular value decomposition, SVD. SVD is part of a family of methods that can approximate an Ndimensional dataset using fewer dimensions, including Principle Components Analysis (PCA), Factor Analysis, and so on. Nonetheless, there is a significant computational cost for the SVD for a large cooccurrence matrix, and performance is not always better than. 3.

(12) using the full sparse vectors.. 2.2. Chinese Related Sentiment Analysis. In recent years, most models are evaluated with the English benchmark. When it comes to multiligual environment, the preprocess approach may differ in languages. The traditional ways to counter the variation of words like stemming or lemmatization are appliable to most Latin languages. Howerver, in Chinese and Japanese, segmentation may also be invloved. In the example of FastText[8], they also demostrated to convert character into pinyin, which make the subword infomation can be obtained. Though most approaches are tested and verified by English dataset, there are some. 政治大 way to classify the articles from 立 WeiBo with Naive Bayes and smoothing with Laplace publications evaluating Chinese dataset as well. Zhao et al. [18] performed the basic. proach, it also applied some imcrement learning.. ‧. Advanced Approach. er. io. sit. y. Nat. 2.3. 學. ‧ 國. smooth. In this work, the authors also use emoticon as the ground truth to verify the ap-. Besides the traditional approaches, they try to extract the effective features to represent. al. n. iv n C h e nand the success at handling complex problems, h i areUalso applied to construct more g cthey. the similarity between words or phrases. The deep learning approaches demonstrated complicated and general models in continuous representations. Additionally, sentiment analysis with typical deep learning models are conducted. Multiple tasks are performed. like parsing (Socher et al., 2013a), language modeling (Bengio et al., 2003; Mnih and Hinton,2009) and NER (Turian et al., 2010). For sentiment analysis, the legendary models were also tested, like CNN [10], RNN [1], but most of them are applied in English dataset only. Recently, word2vec(Mikolov et al. 2013) [14] is considered to perform efficiently to vectorize the meaning of single words. It is a log-bilinear model to learn continuous representations of words on very large corpora efficiently. The similar concept can be adapted to phrase or sentence level as well. Mikolov also purposed it with sentence level [12] called PVDM, and claimed it can be applied to both short text and long articles. These approaches are unsupervised, but it can be conducted to sentiment analysis with proper transformation.. 4.

(13) The other way to model the semantic is to use encoder-decoder model, which come from statistical machine translation. Skip-thoughts[11] employs the GRU encoder-decoder models. And it also combines ”vocabulary expansion” from [13], they used vocabulary expansion to map word embeddings into the RNN encoder space. Therefore, it made it possible to use less vocabulary to build model, and reuse pre-trained model. To detect the sentiment polarity of short text has attracted the interest of study as well, the most used dataset is informal tweets, which is contributed by the netizens from different background. For the task of sentiment classification, an effective feature learning method is to compose the representation of a sentence (or document) from the representations of the words or phrases it contains (Socher et al., 2013b; Yessenalina and Cardie, 2011).. 治政 one-hot vector. They get the better classification results.大 However, it is still not enough 立 to represent the complex meaning or linguistic characteristics. The following works, like. Pang et al. [15] already used bag-of-word representations, and present the word as. ‧ 國. 學. [16], the paper purposed to use deep learning to make sentiment analysis directly.. ‧. There is also a work[17] to evaluate the multilingual approaches and monolingual one. However, it used the Spanish and English as target, both two are belongs to IndoEuropean languages. It also addressed the culture difference, ”dragon” mean harmful in. Nat. n. al. er. io. sit. y. English but it’s opposite in Chinese.. Ch. engchi. 5. i Un. v.

(14) Chapter 3 Methodology 3.1. TF-IDF + SVM. 立. 政治大. ‧ 國. 學. TF-IDF represent as “term frequency–inverse document frequency”. The conventional way to evaluate the semantics based on the occurrence of words and term, and it also. ‧. takes the occurrence of word in global context into consideration, which means the more entry the word present, less meaningful it is.. sit. y. Nat. It is simple and effective, but it still suffers from some disadvantages like data spar-. io. synonym.. n. al. er. sity and high dimension, which may slow down the classifiers, and inability to model. Ch. i Un. v. The SVM we conducted is LinearSVC in sklearn, which uses linear kernel. It used one-vs-rest strategy to handle the multiclass cases. In the other word, it generates fewer. engchi. models than SVC does. We used default parameters. The penalty is l2, and loss function is squared hinge.. 3.2. Fasttext. The approach is purposed by [8]. The structure of FastText can be considered as extension of word2vec as well, and it uses the hierarchical softmax to compute the probabilities for predefined classes. However, the key difference of fasttext from word2vec is that it employed bag of N-grams features. For example the word vector “apple” is a sum of the vectors of the N-grams “<ap”, “app”, ”appl”, “apple”, “apple>”, “ppl”, “pple”, “pple>”, “ple”, “ple>”, “le>”. This approach tries to take the local word order into consideration and evaluate the partial information from spelling. It may also vectorize the rare words. 6.

(15) better, which have less neighboring words to model. It also a countermeasure to solve the problem of bag-of-words, that the order of words is not considered. For example, the terms “我見你” (I see you) and “你見我”(you see me) are contributed to the same results in bag of words model. However, in the ngrams model, the vector are average from its subset, including “我見”(I see), “我” (I)...etc, it makes it possible to distinguish the difference. While the computation and space complexity increase, it employs the hashing trick like word2vec as well. The word representation is looked up through a table and finally averaged into the text representation. For the words absent from word embeddings, it uses subword infomation[2] to guess the meaning of the word. Another trick applied is pruning some of the vocabulary elements. Feature selection among N-gram is very inefficient and complicated. Here it used online parallelizable greedy approach: To check. 政治大. if the document is covered already, if not, add one with the highest norm.. 立. Finally, it uses the linear classification to classify the data. The classifier called. ‧ 國. 學. “Product quantization” uses some tricks to speed up, which utilizes compressed-domain approximate nearest neighbor search (Jegou et al., 2011) [7]. The compression technique approximates a real-valued vector by finding the closest vector in a predefined structured. ‧. set of centroids like Figure 3.1. The original PQ has been concurrently improved by Ge. sit. y. Nat. et al. [6] and Norouzi & Fleet , who learn an orthogonal transform minimizing the overall quantization loss. The technique may sacrifice some accuracy to gain much more perfor-. er. io. mance and efficiency of memory. It is proved by their experiment that with normalization, both PQ and OPQ are almost lossless with 4 subquantizers. We used the released build. al. n. from Facebook github.. Ch. engchi. i Un. v. Figure 3.1: Production quantization: It use predefined centroids to proximate the distance of 2 points. There are two type of PQ, symmetric and asymmetric.. 7.

(16) Figure 3.2: The architecture of FastText: the image is cited from [8].. 3.3. Paragraph Vector. 立. 政治大. ‧ 國. 學. The method is purposed in [12]. The idea is to obtain the summary of paragraphs, sentences or documents. There are 2 different algorithms they purposed, which are distributed memory(DM) and distributed bag of words(DBOW). The DM model in figure. ‧. 3.3a is quite similar with word2vec. The difference between PVDM and word2vec is that. y. Nat. the former contains paragraph matrix. Every paragraph is mapped to unique vector. Figure 3.3b show the model architecture. DBOW is conceptually simple compared to DM. er. io. al. sit. model, this model store less data.. n. The paragraph vectors are asked to contribute to the prediction task of the next word. Ch. i Un. v. given many contexts sampled from the paragraph. The contexts are fixed-length and sampled from a sliding window over the paragraphs. Therefore, DM take the sequence. engchi. into consideration. However, it ignores context words from input and forces the model to predict words randomly sampled from the paragraph in the output in DBOW model. It is quite similar to Skip-gram in word2vec. The author suggested that DM is consistently better in general cases, while DBOW take fewer resources. The author claimed it is applicable to both short sentence and long paragraph. We use the implementation from Gensim, which support both DM and DBOW models. We use SVM with linear kernel as classifier.. 8.

(17) (a) distributed memory (b) distributed bag of words. Figure 3.3: Paragraph vector: the images show the difference of two models, and the images are from [12].. 3.4. Siamese-CBOW. 立. 政治大. The Siamese-CBOW[9] computes sentence embeddings is to average the embeddings. ‧ 國. 學. of its constituent words, instead of using pre-trained word embedding. It applied the concept of bag of word from word2vec. It used the average from the words composing the sentence and use it to evaluate the possibility to predict the sentence around.. ‧. The architecture shows as Figure 3.4. As it indicated, the word embeddings are op-. sit. y. Nat. timized directly for averaging. A supervised training criterion by predicting sentences occurring next to each other in the training data. Cosine similarities are used to com-. n. al. er. io. pute the proximity of sentences. Softmax is applied in the last layer to produce the final probability distribution.. Ch. engchi. i Un. v. The authors also evaluated the effect of the hyperparameters. The number of negative sampling yield limited loss, and the higher dimension is preferred to generate better result. We used the implementation (https://bitbucket.org/TomKenter/siamese-cbow/overview) from the authors. We made it able to run in python3 for better compatibility with unicode.. Figure 3.4: The architecture of Siamese-CBOW, the image is from original paper[9].. 9.

(18) Chapter 4 Experiments 治. 政 Experimental Settings. 立. 大. ‧. ‧ 國. 學 y. Nat. er. io. Figure 4.1: The emoticons in WeiBo. sit. 4.1. al. n. iv n C U of the University of Hong h e nand with API by researchers at the Journalism h i Center g cMedia. The dataset we chose is Open WeiboScope[5], which is collected WeiBo randomly. Kong in 2012. It contains 226 millions posts distributing evenly over the year. The most Weibo users come from the different provinces of China. There are also some users from Hong Kong or oversea. The content of Weibo contains both simplified Chinese and traditional Chinese. Some province dialect is seen in the dataset. It is a Weibo feature to allow the user to use emoticon, and the emoticon in raw data expressed as [笑](smile), [淚](tear). It displays as images like the Figure 4.1. We used the Table 4.1: Tag Category JOY DISGUST SAD FEAR SURPRISE ANGER. 呵呵酷赞乐乐贊鼓掌耶黑线汗晕可憐淚衰失望伤心泪生病囧鄙视委屈可憐吃驚吃惊怒抓狂 10.

(19) Table 4.2: Number for categories ANGER DISGUST FEAR JOY SAD SURPRISE. 331,091 261,955 151,564 717,059 788,492 191,974. tags in posts as the indicators of sentiment, and removed some duplicated posts or some posts without any tags, or too many tags. We evaluated the accuracy of the classification for different algorithms. We used the TF-IDF and SVM (Joachims, 1998). as baseline.. 4.2. Preprocess. 立. 政治大. ‧ 國. 學. For the data preprocessing and cleansing, most posts contain more than one tag. To avoid ambiguity, we only preserve those with the single tag. We removed the posts containing too many tags, or without any tag. We also removed the duplicated posts by their post. ‧. id roughly because it is a property of Chinese microblog [5] for Chinese netizens to post. sit. y. Nat. repeatedly. Besides, we only chose the posts over the certain length (over 10 characters). Finally, we used jieba and dictionary to segment to post.. er. io. We used most-used 6 emotion which most social network support: JOY, SAD, ANGER, FEAR, SURPRISE, DISGUST. We classified these tags into these classes manually. Like. n. al. i Un. v. [18], we also suffered the problem that the numbers of emoticon classes skewed. The numbers of JOY and SAD are more than 50% of posts. We only selected some specific. Ch. engchi. tags from JOY and SAD to make the whole dataset more balance. The mapping table shows in Table 4.1. The JOY contains the tags like 呵呵(haha), 贊(excellent)...etc. The SAD contains the tags like 失望(disappointed), 淚(tear). We removed the tags from the original posts, and there are so many tags. The posts left for 6 categories display in Table 4.2. The classes the most posts belonging to are still JOY and SAD. After the initial round, we found some special string or tokens like username or url may affect the result. Therefore, we also removed those special tokens from the posts as well.. 11.

(20) Table 4.3: FestText Dataset no segmentation segmentation segmentation + pinyin. 4.3. 弊喇,好似有少少喉嚨痛添! 弊喇 , 好似有少少喉嚨痛添 ! b`ı lˇa , hˇaos`ı yˇou shˇao shˇao hóulóngtòng ti¯an !. PVDM. In the Paragraph vector experiment, we tested both DM and DBOW. Additionally, there are 2 different DM supported by Gensim to use average or concatenation. We use DM/C and DM/M to represent concatenation and average separately and used the parameters suggested for three models.. 政治大. The dimension of the vector is 100, and negative-sampling is 5 for both DM and DBOW models, and window size are 5, 10 for concatenation, average separately.. 立. ‧ 國. 學. 4.4. FastText. ‧. In FastText experiment, we tried three formats, including non-segmented dataset, seg-. y. Nat. sit. mented dataset and the dataset with pinyin. We tested the non-segmented dataset since. n. al. er. io. some training data in the demonstration is Japanese without segmentation. We wonder if the segmentation matters. Additionally, we also want to test if the subword informa-. i Un. v. tion [2] works for Chinese. So we convert the dataset to pinyin as well. We can see the differences from the table4.3.. Ch. engchi. For converting to pinyin, we use jieba + pinyin (https://www.npmjs.com/package/pinyin) npm package to convert the characters to pinyin, which also includes the tone. We used the built-in classifier to classify the test set. We iterated through the parameters like window size from 8 to 100, loss function ns, hs, softmax. Since the result did not indicate significant difference between these parameters, we only display 1 of them as reference. We also tested the both dimesion size from 8-300, and loss function including hs (hierarchical softmax), ns (negative sampling), softmax.. 12.

(21) Table 4.4: Results: the best accuracy of different models Tf-IDF PVDM(dbow) Fasttext Fasttext(Pinyin) Siamese-CBOW(10) with pre-train. 0.44 (± 0.04) 0.40 0.51 0.51 0.45 (± 0.02). Table 4.5: Fasttext no segmentation segmentation pinyin. 4.5. Siamese-CBOW. 立. 8 0.369 0.515 0.513. 12 0.375 0.515 0.518. 16 0.389 0.514 0.516. 32 0.372 0.516 0.517. 64 0.368 0.513 0.51. 政治大. We use Siamese-CBOW with the default parameters that the authors suggested. The. ‧ 國. 學. dimension of the vector is 100, and update algorithm is ada delta. We ran it with epoch 5 and 10 separately without any pre-trained word embeddings, so it generates the word embeddings from scratch. Due to low performance of the trial without pre-trained word. ‧. embeddings, we also tried to use Gensim to generate the pre-train word embeddings from. er. io. a. n. 4.6. sit. vectors are generated, we use Linear SVC to classify the results.. y. Nat. our dataset. With pre-trained word embeddings, we tried epoch 10 to run. After the. l C Experimental Results. hengchi. i Un. v. It took around 2 days with GPU NVIDIA TITAN X (PASCAL) to finish Siamese-CBOW word embeddings training. The performance with epoch 5, 10 are below the baseline. The other models were finished with CPU within hours. Figure 4.6 and Figure 4.7 show the accuracy for Siamese-CBOW and PVBM separately. In the experiment of Siamese-CBOW, we found that the result without pre-trained embedding of Siamese-CBOW is below the baseline. We trained the original dataset with Gensim word2vec as pre-trained embeddings. The result is improved by the pre-trained embeddings, but it is not significant statistically from the baseline. In the result of PVDM, Table 4.6: Results: the accuracy of Siamese-CBOW Siamese-CBOW(5) Siamese-CBOW(10) Siamese-CBOW(10) with pre-train 13. 0.41 (± 0.04) 0.39 (± 0.03) 0.45 (± 0.02).

(22) Table 4.7: Result of PVDM DM/C DM/M DBOW. Test set 0.384 0.38 0.404. Training Set 0.384 0.436 0.457. DBOW performs better than the other two DM models by little. The overall accuracy is a little below of baseline.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 4.2: The confusion matrix for TF-IDF+ SVM We also look into the classification results with confusion matrices. In Figure 4.2 and Figure 4.3a, we can see that the 2 major classes give the best accuracy while the test results skewed into these 2 major classes as well. The ANGER, DISGUST, and FEAR are more likely to be classified as SAD. It is reasonable that those posts contain more negative sentiment. The tendency in the Siamese-CBOW result is more obvious. It classified no entry into some rarely used classes, which may also result from its low performance. We also evaluated all classification details of FastText experiments. In the confusion matrix for FastText in Figure 4.4b, it shows different tendency for Pinyin dataset classification. The accuracy for SURPRISE is higher than SAD. It indicates that the SURPRISE may be modeled better than SAD, which contains more samples. Figure 4.4a is quite 14.

(23) 立. 政治大. ‧ 國. 學 ‧. (a). n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. (b). Figure 4.3: (a) The training set without pre-trained embedding, (b) The training set with pre-trained embedding. 15.

(24) 立. 政治大. ‧ 國. 學. (a). ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. (b). Figure 4.4: (a) segmented dataset trained with dimension=300, window size=20, loss function hierarchical softmax, (b) pinyin dataset with dimension=200, window size=15, and loss function = softmax. 16.

(25) similar with confusion matrices in other experiments, where most test data are classified as JOY and SAD, because the two classes contain the majority of the posts. Most of the results of the segmented dataset are alike despite of the settings of loss function and dimension.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 17. i Un. v.

(26) Chapter 5 Discussions 5.1. 政治大. Discussion. 立. ‧ 國. 學. For the baseline, TF-IDF gives the accuracy 0.44(±0.04). The most distinguishable features the classifier uses are some rarely used terminology. Since we only removed the. ‧. duplicated post roughly, it may still suffer from the duplicated post from different sources with certain rarely used words. In general, the model is not general enough, it may not be applicable when the dataset changed.. sit. y. Nat. Additionally, compared to other methods, those approaches convert the sentence to. io. n. al. er. vectors with lower dimensions. Theoretically, it may benefit the classifier with the more. i Un. v. dense representation. Sparse matrix causes much more time and space to train. With such high dimension vectors, it is unlikely to apply the non-linear classifier.. Ch. engchi. Generally, FastText can retrieve the better accuracy with segmentation inclusion and is able to train the large-scale dataset more efficiently with its own classifier. With the conversion to Pinyin, it also achieves the similar accuracy. Though we tried the different settings for FastText, the accuracy is not different significantly despite the various settings of loss function, window size, and dimensions. In the comparison set, segmented dataset outperforms the one without segmentation. It suggested that the term itself may be more meaningful than a single character. It also took much less time than that of other implementations. Although we did not perform it with linear classifier like that in other experiments, we can consider that the accuracy is not lossless.. 18.

(27) Table 5.1: The feature extracted from TF-IDF ANGER 想瘦造謠恶浪顶缸银行利息卫冕冠军落水狗剖腹自杀剥下多吉. 5.2. DISGUST 抬杠赃款由此看来淫妇觀天象解除上刑矢泽爱超现实主义美國使館. FEAR 含铅谋财害命表同情经济体可憐見突如其来万念俱灰供应站勞資深红. JOY 余则符离驻守神明 asce 三元里何苦来哉太妙了两面三刀 slient. SAD grunewald 节省时间张春桥前导已閱 q1050505041 杀伐离世噴火龍查無此人. SURPRISE 印钞机绝密 54:03.7 清福任免 karei 一桩 sikucd touchsmart610 翻筋斗. Baseline. 政治大 We used TF-IDF as the baseline, 立 and the result is 0.44(±0.04), which is statistically. greater than the random result in 6 classes. We tried to evaluate the feature that TF-IDF. ‧ 國. 學. used to classify as Table 5.1. Though some term here seemed related like, 2nd place in ANGER is 造謠 (create humor), the third one is 恶浪(violent billow). The second. ‧. place in FEAR is 谋财害命(commit murder for money). The most term is not so related intuitively.. y. Nat. sit. Most of the words in each class actually are not used so commonly. It is contributed. al. er. io. by the problem of IDF, which may overweight some rarely-used words. The other prob-. v. n. lem is that we observed the netizens kept posting something like advertisement, joke or. Ch. i Un. news..., Though their contents are not the same exactly, the words or terminology they used may be quite common. Aside from TF-IDF, most algorithms assume that the articles. engchi. themselves are distinct from others. In the real world, it is not a common case except some well-organized website articles. We may require some extra preprocess procedure to handle to prevent intentionally repeated posts. Some publications [3] show that the posters may use some morphs to avoid censorship, which we cannot evaluate how these words contribute to the sentiment analysis. However, the similarity of morphs can be modeled by word2vec with proper context and volume. It is relatively difficult to identify in TF-IDF model.. 5.3. Siamese-CBOW. The Siamese-CBOW, the performance is below the baseline. We tried evaluate the model. 19.

(28) it trained, it seemed it is not converged enough. The word embeddings are not converted correctly. For example, we use “我” (I) to query in the word embeddings generated by Siamese-CBOW 5 epoch. The related words it shows are (hurt, rarely-used word),第四节 (the fourth quarter) and 贾宝玉(the name in the novel). With 10 epoch word embedding, the related words are 几家 (Some homes), (peek), and 速成班 (rapid-archieve class). It seemed there is no sign of converge. In the confusion matrix, we found the most tested result fall into two major classes. It is quite similar as other models due to data imbalance. We tried to use pre-trained word-embeddings with 10 epoch to improve it, and the result is improved to the level of baseline. However, when we assessed the embeddings it generated, the embeddings are still far from converging. According to original paper, the proper embeddings can be trained properly. We are not sure if the property is not available in Chinese dataset.. 立. 政治大. In the original paper, the dataset they conducted with is Toronto Books, which con-. ‧ 國. 學. tains the novels. Therefore, the semantics of the sentences may be more highly coherent. ‧. with previous sentence and next one. However, it may affect how it determines the relationship between sentences in our cases.. y. Nat. Using some pre-trained embeddings may improve the performance. Another draw-. sit. back of Siamese-CBOW is that it does not support the feature like subword information.. n. al. er. io. It means if the words are absent from its training dataset, it would be considered nonexistent at all. Conducting the vocabulary expansion like that in Skip-Thoughts may assist the problem.. 5.4. Ch. engchi. i Un. v. PVDM. In paragraph vector experiment, the result shows that DBOW produced the best accuracy among 3 models. In the original paper, the author suggested that the DM is consistently better than DBOW, and that the sum version of DM is often better than concatenation. So far, it is not clear under what condition that DBOW outperform the DM model. We tried to leverage the model it built. We fetched most similar word of I (我) as Table 5.2. Surprisingly, the similar words of DBOW are all not related words. Both DM/C and DM/M generated better results, which top 10 related words are synonyms of I. It seemed predictable that DBOW stored less data to train. DM/C and DM/M actually model the word in more proper way though the accuracy their accuracy is not good. Though the result is better for DBOW, it may not be robust result. We need more different dataset to 20.

(29) Table 5.2: The most similar 5 words of I (我) in 2 models of PVDM DM/C 俺偶老子哀家皮下. 1 2 3 4 5. DBOW 三条田温暖人间 youtudou 化水. DM/M 偶他俺我们她. Table 5.3: Similar words with “nˇı”(you) in Pinyin dataset. y. sit. n. al. er. io. FastText. Nat. 5.5. 政治大. ‧. evaluate further.. ‧ 國. 立. Chinese 我(I) 你自己(yourself) 你們(you) ,”你(,”you) 誰(who) 誰他媽(who in the hell) 我會(I can) ,”誰(who) ,”誰他媽(who in the hell) 別人(others). 學. word related wˇo nˇız`ıjˇı nˇımén ,”nˇı shu´ı shu´ıt¯am¯a wˇohu`ı ,”shu´ı ,”shu´ıt¯am¯a biérén”. Ch. i Un. v. In FastText experiment, we tried to evaluate the 3 different dataset, segmented, nonsegmented and pinyin. Although pinyin dataset archive the same overall accuracy similar. engchi. to segmented one, the confusion matrix show different tendency. Despite of the various settings of loss function, window size, both segmented dataset and non-segmented one classified most entries into 2 major classes. While with some specific parameters, the classifier on pinyin dataset can classify the minor class as well. We tried to validate the property of vectors generated by pinyin dataset with FastText cbow and skip-gram as Table 5.3. It approves that both cbow and skip-gram can generate the pinyin word embeddings efficiently. However, for the non-segmented dataset, the word-embeddings consists vectors of single characters, they are not able use subword information either. The other finding about word vector it generated. Some segmented dataset contains some poorly segmented term like “坑爹”(cheating me),“有木有”(if or not)..., which are new internet language so can not be handled properly by dictionary file, are segmented as 21.

(30) Table 5.4: words with low norm in 3 dataset in FastText dataset segmented non-segmented. words 旳, 颂, 硬伤, 財, 索取, 图像, 巾喝水还那么麻烦, 给我试试好么@亚瑟小狼狗....etc. pinyin separate characters. While the word2vec can merge those characters together due to the high occurrence of these characters. Somehow, FastText treat them as separate term but they may be related highly. We tried to use T-SNE to visualize the vector space in Figure 5.1. As we can see, the vector space with some ambiguous boundary.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 22. i Un. v.

(31) (a) segmented dataset, dimesion=200 , window size=15, loss function= hierarchical softmax. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. i Un. v. (b) pinyin dataset, dimesion=200 , window size=15, loss function= softmax. engchi. Figure 5.1: Visualization of vector space for 2 dataset. We tried to analyze the vectors it generates. The number of vectors generated with 3 dataset with 100,000 posts are 25,743, 3,137 and 22,233 for segmented, non-segmented and pinyin dataset separately. We listed the vectors with the lowest norm in Table 5.4. As the paper indicated that the vector with lower norm, it contains less information. Therefore, most of them are stop words. Surprisingly, the non-segmented dataset contains much fewer words than other 2 models do. Additionally, the most words it contains are like the sentences rather than the terms. It implied that the FastText itself can not handle segmentation properly, and explains why it performs poorly. For the segmented dataset, the vectors with the lowest norm are some rarely used words, which can not be modeled properly with low occurrence and few neighboring words around. For the pinyin dataset, the vectors with the lowest norm are those special abbreviation and icons. Compared with 23.

(32) Table 5.5: ngrams constitution of “我們的”(ours) key <我們的> <我們的我們的> 們的> 我們的 <我們. similarity 0.02 0.09 0.07 -0.07 0.19 *0.99. segmented dataset, those rarely used word may have the the same pronunciation with other words or can be inferred from subword information.. 5.5.1. N-grams Evaluation. 政治大 We evaluated the vectors from立 ngram level. We found that the term like “我們”(our) generated by 4 vectors: “我們”, “<我們”, “我們>”, “<我們>” in the segmented dataset,. ‧ 國. 學. while the counterpart in the pinyin dataset consist 15 vectors, including “wˇomén”, “<wˇo”,. ‧. “<wˇom”, “<wˇomé”, “<wˇomén”...etc. The segmented n-grams vectors do not contain single character like “我”, “們”, becasuse the setting for prefix or subfix for English consist more than 2 or 3 letters above.. y. Nat. sit. We tried to evaluate the word “我們的”(ours) in both segmented vectors and pinyin. al. er. io. vectors in Table 5.5 and Table 5.6. We can see the pinyin contain much more combina-. v. n. tions, and the term “wˇomén-de” which segmented properly indicated high proximity to the complete term. In the segmented set, we can see the same tendency. “們的”, the term. Ch. engchi. segmented poorly, generate negative similarity.. i Un. The length of ngrams is limited in 3-5 in this experiment. It may be better to model single character in Chinese by shortening the ngram minimum length. Additionally, the some pinyin may be constituted by only 2 letters. Lowering the minimum length may model them better as well, but it may also contribute the extra computation effort. The length of prefix and suffix may vary depending on the languages, so it may be the parameter we can optimize FastText depends on the languages.. 5.5.2. Subword Information. According to the paper, the feature of subword information can compensate the insufficiency of word embeddings. We tried to evaluate if the features work in pinyin as well, which means semantic can be inferred from the pronunciation. Although we converted 24.

(33) Table 5.6: ngrams constitution of “wˇoménde” (ours) in pinyin key <wˇo - ménde> <wˇom - e´ nde> <wˇomé - nde> <wˇomén - de> wˇomén ménde mén. similarity 0.73 - 0.14 0.70 - 0.19 0.70 - 0.26 0.70 - 0.62 0.70 0.16 0.35. Table 5.7: Query the word non-existing in dataset 吃不起吃不饱买不起吃不完上不起经不起. 立. 给力 ok 得意 [ lt 想念. g¯ongxˇıf¯acái(恭喜) g¯ongxˇı (恭喜) g¯ongq´ıjùn (宫崎骏) g¯ongp¯u(公布) g¯ongrè(公认) g¯ongpó(公婆). 政治大. ‧ 國. 學. the dataset to pinyin, the accuracy is not significantly different from the original accuracy. Intuitively, pinyin is less readable to native speaker and not reversible to the characters,. ‧. since the multiple characters own the same pronunciation. Chinese contains fewer syllables and more homonyms than English do. In the original paper[2], they evaluate the. Nat. sit. y. effectiveness in various languages like Arabic, Czech, German, English...etc. All of them. io. to logogram specially.. n. al. er. belong to phonograph, and most languages belong to phonograph, while Chinese belongs. Ch. i Un. v. We tried some examples in segmented and pinyin dataset in Table 5.7 The term acquired shared some similarity in constitute characters, so their meaning are similar by. engchi. some degree. It can not prove the model can assess the semantic of a new term precisely by its constitute character. For “给力” (it works or supports), ”ok” may be close in some degree, but others are not. In the example of pinyin seemed to only return the terms shared high morphological similarity. In the [13], it also provides similar function to compensate the words absent from training set. It employs similarity of pre-trained word embeddings by mapping the pretrained space to pre-trained one to expand its vocabulary. There are also some different approaches like morphological annotated data, which were introduced by Cotterell and Schütze (2015). It is a good topic to evaluate the difference of these approaches.. 25.

(34) Chapter 6 Conclusion 治政大 In general, most models that PVDM and FastText are invariant to language property. 立 improve the semantic analysis compared with traditional TF-IDF, and it is more efficient We demonstrated the various modern methods on the Chinese corpus, and it indicated. ‧. ‧ 國. different context.. 學. to extract the information with more dense vectors. Different models can be used in. Most methods are developed based on properties of English or Latin languages, so. sit. y. Nat. segmentation or other preprocess work play crucial roles to make other language generalized. But the segmentation may also contribute something wrong. Though some of them. io. al. n. segmentation.. er. also can be conducted with non-segmented sentences, it performed worse due to improper. Ch. i Un. v. We can see FastText demonstrate excellent property in both performance including. engchi. training and testing time and memory utilization. Besides the accuracy of the semantic conversion, both the performance and the efficiency of memory also become the interest of study, since the information grows so quickly.. 26.

(35) Reference [1] G. Arevian. Recurrent neural networks for robust real-world text classification. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 326–329. IEEE Computer Society, 2007.. 政治大 subword information. arXiv preprint arXiv:1607.04606, 2016. 立. [2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with. ‧ 國. 學. [3] L. Chen, C. Zhang, and C. Wilson. Tweeting under pressure: Analyzing trending topics and evolving word choice on sina weibo. In Proceedings of the First ACM Conference on Online Social Networks, COSN ’13, pages 89–100, New York, NY,. ‧. USA, 2013. ACM.. y. Nat. [4] K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. A. Hawalah, A. Gelbukh,. sit. and Q. Zhou. Multilingual sentiment analysis: State of the art and independent. er. io. comparison of techniques. Cognitive Computation, 8(4):757–771, Aug 2016.. n. al. i n C U h e n g c2013. sampling approach. PloS one, 8(3):e58356, hi. v. [5] K.-w. Fu and M. Chau. Reality check for the chinese microblog space: a random. [6] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2946–2953, 2013. [7] H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 861–864. IEEE, 2011. [8] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016. [9] T. Kenter, A. Borisov, and M. de Rijke. Siamese cbow: Optimizing word embeddings for sentence representations. 2016.. 27.

(36) [10] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [11] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015. [12] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents icml. 2014. [13] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013. [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. pages 3111–3119, 2013.. 政治大 machine learning techniques. In Proceedings of the ACL-02 conference on Empirical 立 methods in natural language processing-Volume 10, pages 79–86. Association for. [15] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using. ‧ 國. 學. Computational Linguistics, 2002.. [16] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning sentiment-specific. ‧. word embedding for twitter sentiment classification. In ACL (1), pages 1555–1565,. sit. y. Nat. 2014.. io. ysis in multilingual environments. 53, 05 2017.. n. al. Ch. er. [17] D. Vilares, M. Alonso Pardo, and C. Gómez-Rodr´ıguez. Supervised sentiment anal-. i Un. v. [18] J. Zhao, L. Dong, J. Wu, and K. Xu. Moodlens: an emoticon-based sentiment. engchi. analysis system for chinese tweets. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1528–1531. ACM, 2012.. 28.

(37)