結論與未來發展 - 中文部落格文章之相關性擷取與意見傾向分析之研究

本研究以政論性文章為實驗文章，並訂出學運、馬英九、馬英九與學運主題，

針對不同主題進行主題相關的意見傾向分析實驗，實驗分為主題相關文章擷取和主題相關文章意見傾向分析兩個階段，在實驗方法中，本研究嘗試了非監督式方法與監督式方法。

在主題相關的方法中，非監督式方法的概念是使用 query expansion 方法擴充與主題相關的詞彙，並以擴充詞彙擷取出更多與主題相關的文章，核心方法主要使用 Pointwise Mutual Information 的公式計算詞彙和主題的關聯程度，以非監督式方法擷取主題相關文章的實驗結果中，query expansion 的方法雖然能擷取更多不包含主題詞卻與主題相關的文章，但是同時也擷取一些與主題不相關的文章，

所以造成 precision 反而比以主題詞擷取主題相關文章的 precision 低。監督式方法本研究選擇使用 SVM 進行文章分類，訓練詞彙是以 CHI 公式計算名詞詞彙的分數，選取排序前 150 個詞彙，加上從排序一半再往後面的名次選取 150 個詞彙，

本研究發現有些詞彙共同出現時，會代表與主題相關，在本研究中本研究定義該詞彙組合為 rule pattern，本研究將這個特性增加至特徵中，並且針對兩個主題的交集－馬英九與學運主題，紀錄訓練詞彙與不同主題詞彙共同出現的特徵值。實驗結果顯示監督式方法比非監督式方法好，以監督式方法擷取學運主題相關文章的 precision 為 91.72%，馬英九主題的 precision 為 95.60%，馬英九與學運主題的

precision 為 92.40%。

主題相關文章之意見傾向分析的方法中，非監督式方法的部分，本研究嘗試分析句子的結構，並且使用 lexicon-based 方法，根據 NTUSD 情感辭典給予文章極性，並且分析句子是否有否定詞、轉折詞，以及句尾是否為問號，判斷是否需要改變極性。監督式方法以詞彙在不同極性文章出現頻率選取訓練詞彙，由於是政論性文章，所以情感辭典中的詞彙極性不一定代表在實驗文章詞彙的極性，例如： “革命”被分類為正面詞彙，但是在政論性文章中代表負面詞彙。因此，本研究以監督式方法修改情感辭典的極性，但是仍有許多詞彙是不在情感辭典中，卻代表特別的極性，例如：“媽寶”、 “草莓”、“綠衛兵”都是負面詞彙，所以實驗最好結果是強調詞彙在訓練資料中的極性，然後擷取主題句與 rule pattern 句的單一極性詞彙作為特徵，以實際被標為有極性的文章為實驗文章的實驗結果中，學運主題意見傾向分析的 precision 為 76.73%，馬英九主題的 precision 為 79.03%。

在相關性擷取的方法中，本研究使用了 Pointwise Mutual Information 和卡方檢驗等 Information Retrieval 公式計算詞彙與分類類別之間的關聯性，未來可以嘗試使用潛在語意分析 (Latent Semantic Analysis, LSA)探討詞彙對於文件的重要程度，擷取對於分類有幫助的詞彙。除此之外，意見傾向分析的實驗結果仍有改進的空間，在本研究中，雖然使用監督式學習方法分析意見傾向的 precision 達到 70%

以上，但是監督式方法需要耗費大量時間、人工標記資料，未來可以使用半監督式學習方法，利用大量容易取得的未標記過、未處理過的資料，建造一個更精確

的分類器，例如利用本研究在監督式學習方法所擷取的特徵到未標記過的資料中去找有同樣特徵的句子，建立訓練模型。希望能透過半監督式方法修正被分類錯的情況，並且提昇效能以及減少人工標記的動作。

參考文獻

Andreevskaia, A. and Bergler, S. (2008). When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging. Proceedings of ACL, pp. 290-298.

Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, pp. 993-1022.

Chang, C.C. and Lin, C.J., LibSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ , 2008.

Chen, K.J. and Liu, S.H. (1992). Word Identification for Mandarin Chinese Sentences.

Proceedings of COLING 1992, pp. 101-107.

Cilibrasi, R.L. and Vitanyi, P.M. (2007). The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.3, pp. 370-383.

CKIP 中文斷詞系統. Available from http://ckipsvr.iis.sinica.edu.tw/

Duan, X., He, T. and Song, L. (2010). Research on Sentiment Classification of Blog based on PMI-IR, Proceedings of 2010 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1-6.

Facebook. Available from http://www.facebook.com FumouDiscuss. Available from

http://webptt.com/m.aspx?n=bbs/FuMouDiscuss/index.html

Ghorpade, T. and Ragha, L. (2012). Featured Based Sentiment Classification for Hotel Reviews using NLP and Bayesian Classification. Proceedings of 2012 International Conference on Communication, Information & Computing Technology, pp. 1-5.

Harman, D. (1988). Towards Interactive Query Expansion, Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 322-323.

Hofmann, T. (1999,). Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57.

Huang, S., Han, W., Que, X. and Wang, W. (2013). Polarity Identification of Sentiment Words based on Emoticons, Proceedings of 2013 9th International Conference on Computational Intelligence and Security, pp. 134 - 138.

ICTCLAS. Available from http://ictclas.nlpir.org/

Jaynes, E.T. (1957). Information Theory and Statistical Mechanics. Physical review, 106(4), pp. 620-630.

John, G.H. and Langley, P. (1995). Estimating Continuous Distributions in Bayesian Classifiers. Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence, pp. 338-345.

Khan, K., Baharudin, B.B. and Khan, A. (2009). Mining Opinion from Text Documents: A Survey. Proceedings of 2009 3rd IEEE International Conference on Digital Ecosystems and Technologies, pp. 217-222.

Ku, L.W., and Chen, H.H. (2007). Mining Opinions from the Web: Beyond Relevance Retrieval. Journal of the American Society for Information Science and Technology, 58(12), pp. 1838-1850

Ku, L.W., Liang, Y.T., and Chen, H.H. (2006). Opinion Extraction, Summarization and Tracking in News and Blog Corpora. AAAI Spring Symposium:

Computational Approaches to Analyzing Weblogs, Vol. 100107.

Landis, J.R. and Koch, G.G. 1977. The Measurement of Observer Agreement for Categorical Data Biometrics, pp. 159-174.

Li, S., He, H., Xu, W. R. and Guo, J. (2009). Automatic Chinese Sentiment Word Extraction based on Maximum Entropy. Proceedings of the 2009 International Conference on Wavelet Analysis and Pattern Recognition, Baoding, pp. 437- 441.

Li, Z. H., Xu, Y. and Geva, S. (2008). Text Mining based Query Expansion for Chinese IR. Proceedings of the Australasian Language Technology Association Workshop 2008, pp. 73-78.

Lu, B. and Tsou, B.K. (2010). Combining a Large Sentiment Lexicon and Machine Learning for Subjectivity Classification. Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, pp. 3311-3316.

Luo, J., Meng, B., Tu, X.H. and Gu, J.G. (2010). Selecting Good Expansion Terms based on Google Similarity Distance. Proceedings of 2010 2nd International Conference on Future Computer and Communication, V2-711-V2-714.

Pang, B. and Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2), pp. 1-135.

PTT 八卦版. Available from http://webptt.com/m.aspx?n=bbs/Gossiping/index.html Qiu, L., Zhang, W., Hu, C. and Zhao, K. (2009). SELC: a Self-supervised Model for

Sentiment Classification. Proceedings of CIKM, pp. 929-936.

Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.

Sim, J. and Wright, C.C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements, Physical Therapy, 85, pp. 257-268.

Stanford Parser. Available from http://nlp.stanford.edu/software/lex-parser.shtml Stop Word List. Available from

https://sites.google.com/site/kevinbouge/stopwords-lists

Sui, H., Jianping, Y., Hongxian, Z. and Wei, Z. (2012). Sentiment Analysis of Chinese Micro-blog Using Semantic Sentiment Space Model. Proceedings of 2012 2nd International Conference on Computer Science and Network Technology, pp.1443-1447.

Tu, X.H., He, T.T., Luo, J., Chen, J.G., Chen, L. and Yang, Z.K. (2008). Chinese Query Expansion Based on Topic-Relevant Terms. Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, pp. 1 -5.

Udn Blogs. Available from http://blog.udn.com/

Vapnik, N.V. (1995). The Nature of Statistical Learning Theory. Springer.

Viera, A.J. and Garrett, J.M. (2005). Understanding Interobserver Agreement: the Kappa Statistic, Family Medicine, 37(5), pp. 360-363.

Wang, B., Min, Y., Huang, Y., Liu, Y., Li, X., Sun, Y. and Sun, C. (2013). Chinese Reviews Sentiment Classification based on Quantified Sentiment Lexicon and Fuzzy Set. Proceedings of 2013 International Conference on Information Science and Technology, pp.677-680.

Wang, J.H. and Lee, C.C (2011). Unsupervised Opinion Phrase Extraction and Rating in Chinese Blog Posts. Proceedings of 2011 IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing, pp. 820-823.

Yahoo 奇摩新聞搜尋引擎. Available from https://tw.news.yahoo.com/

Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. ICML, Vol. 97, pp. 412-420.

Yang, Y. and Zhou, Y.Q. (2011). Chinese Sentiment Classification based on Semantic Structure of Sentences. Proceedings of 2011 International Conference on Computer Science and Network Technology, pp. 1745-1749.

Ye, Q., Zhang, Z. and Law, R. (2009). Sentiment Classification of Online Reviews to Travel Destinations by Supervised Machine Learning Approaches. Expert Systems with Applications, vol. 36, pp. 6527-6535.

Zan, H., Kou, K., Tian, J. and Sin, R. (2010). Applications of Chinese Sentiment Categorization to Digital Products Reviews. Proceedings of 2010 International Conference on Natural Language Processing and Knowledge Engineering, pp.1-5.

Zhai, Z., Liu, B., Wang, J., Xu, H., and Jia, P. (2011). Product Feature Grouping for Opinion Mining Using Soft-Constraints and EM. Intelligent Systems, IEEE, vol.

PP, issue no.99, pp. 1.

Zhai, Z., Xu, H. and Jia, P. (2010). An Empirical Study of Unsupervised Sentiment Classificationof Chinese Reviews. Tsinghua Science & Technology, 15(6), pp.

702-708.

Zhang, H., Yu, Z., Xu, M. and Shi, Y (2012). An Improved Method to Building a Score Lexiconfor Chinese Sentiment Analysis. Proceedings of 2012 Eighth International Conference Semantics, Knowledge and Grids, pp. 241 - 244.

Zheng, W. and Ye, Q. (2009). Sentiment Classification of Chinese Traveler Reviews by Support Vector Machine Algorithm, 2009 Third International Symposiumon Intelligent Information Technology Applications, vol. 3, pp. 335-338,.

Zhuo, S., Wu, X. and Luo, X. (2014). Chinese Text Sentiment Analysis based on Fuzzy Semantic Model. Proceedings of 2014 IEEE 13th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 535-540.

石琢暐，支援向量機簡介，2011 年，Available from

http://eeil.ime.ncku.edu.tw/knowledgebase/zhi-yuan-xiang-liang-ji-support-vector -machine

林揚書，網際網路新聞文章心情偵測之研究，國立交通大學資訊工程所碩士論文，2009 年。

知網情感分析用詞語集. Available from http://www.keenage.com/

游和正，黃挺豪，陳信希，領域相關詞彙極性分析及文件情緒分類之研究. 中文計算語言學期刊，2012 年。

黃建銘，支撐向量機的自動參數選擇，國立台灣科技大學資訊工程系碩士論文，

2005 年。

在文檔中中文部落格文章之相關性擷取與意見傾向分析之研究 (頁 97-104)