Subword Information

5.2 Baseline

5.5.2 Subword Information

Table 5.5: ngrams constitution of “我們的”(ours) key similarity

segmented dataset, those rarely used word may have the the same pronunciation with other words or can be inferred from subword information.

5.5.1 N-grams Evaluation

We evaluated the vectors from ngram level. We found that the term like “我們”(our) gen-erated by 4 vectors: “我們”, “<我們”, “我們>”, “<我們>” in the segmented dataset, while the counterpart in the pinyin dataset consist 15 vectors, including “wˇom´en”, “<wˇo”,

“<wˇom”, “<wˇom´e”, “<wˇom´en”...etc. The segmented n-grams vectors do not contain single character like “我”, “們”, becasuse the setting for prefix or subfix for English con-sist more than 2 or 3 letters above.

We tried to evaluate the word “我們的”(ours) in both segmented vectors and pinyin vectors in Table 5.5 and Table 5.6. We can see the pinyin contain much more combina-tions, and the term “wˇom´en-de” which segmented properly indicated high proximity to the complete term. In the segmented set, we can see the same tendency. “們的”, the term segmented poorly, generate negative similarity.

The length of ngrams is limited in 3-5 in this experiment. It may be better to model single character in Chinese by shortening the ngram minimum length. Additionally, the some pinyin may be constituted by only 2 letters. Lowering the minimum length may model them better as well, but it may also contribute the extra computation effort. The length of prefix and suffix may vary depending on the languages, so it may be the param-eter we can optimize FastText depends on the languages.

5.5.2 Subword Information

According to the paper, the feature of subword information can compensate the insuffi-ciency of word embeddings. We tried to evaluate if the features work in pinyin as well, which means semantic can be inferred from the pronunciation. Although we converted

‧

Table 5.6: ngrams constitution of “wˇom´ende” (ours) in pinyin key similarity

Table 5.7: Query the word non-existing in dataset 吃不起给力 g¯ongxˇıf¯ac´ai(恭喜)

the dataset to pinyin, the accuracy is not significantly different from the original accuracy.

Intuitively, pinyin is less readable to native speaker and not reversible to the characters, since the multiple characters own the same pronunciation. Chinese contains fewer syl-lables and more homonyms than English do. In the original paper[2], they evaluate the effectiveness in various languages like Arabic, Czech, German, English...etc. All of them belong to phonograph, and most languages belong to phonograph, while Chinese belongs to logogram specially.

We tried some examples in segmented and pinyin dataset in Table 5.7 The term acquired shared some similarity in constitute characters, so their meaning are similar by some degree. It can not prove the model can assess the semantic of a new term precisely by its constitute character. For “给力” (it works or supports), ”ok” may be close in some degree, but others are not. In the example of pinyin seemed to only return the terms shared high morphological similarity.

In the [13], it also provides similar function to compensate the words absent from training set. It employs similarity of trained word embeddings by mapping the pre-trained space to pre-pre-trained one to expand its vocabulary. There are also some different approaches like morphological annotated data, which were introduced by Cotterell and Sch¨utze (2015). It is a good topic to evaluate the difference of these approaches.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 6 Conclusion

We demonstrated the various modern methods on the Chinese corpus, and it indicated that PVDM and FastText are invariant to language property. In general, most models improve the semantic analysis compared with traditional TF-IDF, and it is more efficient to extract the information with more dense vectors. Different models can be used in different context.

Most methods are developed based on properties of English or Latin languages, so segmentation or other preprocess work play crucial roles to make other language general-ized. But the segmentation may also contribute something wrong. Though some of them also can be conducted with non-segmented sentences, it performed worse due to improper segmentation.

We can see FastText demonstrate excellent property in both performance including training and testing time and memory utilization. Besides the accuracy of the semantic conversion, both the performance and the efficiency of memory also become the interest of study, since the information grows so quickly.

‧

[1] G. Arevian. Recurrent neural networks for robust real-world text classification. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 326–329. IEEE Computer Society, 2007.

[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.

[3] L. Chen, C. Zhang, and C. Wilson. Tweeting under pressure: Analyzing trending topics and evolving word choice on sina weibo. In Proceedings of the First ACM Conference on Online Social Networks, COSN ’13, pages 89–100, New York, NY, USA, 2013. ACM.

[4] K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. A. Hawalah, A. Gelbukh, and Q. Zhou. Multilingual sentiment analysis: State of the art and independent comparison of techniques. Cognitive Computation, 8(4):757–771, Aug 2016.

[5] K.-w. Fu and M. Chau. Reality check for the chinese microblog space: a random sampling approach. PloS one, 8(3):e58356, 2013.

[6] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2946–2953, 2013.

[7] H. J´egou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors:

re-rank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 861–864. IEEE, 2011.

[8] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J´egou, and T. Mikolov. Fast-text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.

[9] T. Kenter, A. Borisov, and M. de Rijke. Siamese cbow: Optimizing word embed-dings for sentence representations. 2016.

‧

[10] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

[11] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fi-dler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015.

[12] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents icml. 2014.

[13] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013.

[14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed represen-tations of words and phrases and their compositionality. pages 3111–3119, 2013.

[15] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for Computational Linguistics, 2002.

[16] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL (1), pages 1555–1565, 2014.

[17] D. Vilares, M. Alonso Pardo, and C. G´omez-Rodr´ıguez. Supervised sentiment anal-ysis in multilingual environments. 53, 05 2017.

[18] J. Zhao, L. Dong, J. Wu, and K. Xu. Moodlens: an emoticon-based sentiment analysis system for chinese tweets. In Proceedings of the 18th ACM SIGKDD in-ternational conference on Knowledge discovery and data mining, pages 1528–1531.

ACM, 2012.

在文檔中深度學習於中文句子之表示法學習 - 政大學術集成 (頁 32-36)

5.2 Baseline

5.5.2 Subword Information

5.5.1 N-grams Evaluation

5.5.2 Subword Information

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 6 Conclusion

‧

‧

立政治大學