結論與建議

從中文社群媒體的應用上出發，以太陽花學運為個案，採取 Topic Coherence 和 Topic Distance 兩項度量進行評估，提供系統建置模型效果評估參照。該個案收集有六個不同來源的資料集，經本研究發現 Twitter 資料集在 Topic Coherence 指標中表現最差，是故本系統較不適合應用於內容字數太少之資料。但對於自動解

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

題下的文件數量，以瞭解各個來源的主題組成比例和傾向。我們發現 Twitter 和蘋果日報話題最為集中。

最後，就本研究實驗所得分析結果彙整說明如下：

1. 議題多為反服貿相關討論。

2. 實驗結果顯示在 Facebook 上關心香港政治、反核、綠黨的社團亦特別有意識關注學運話題並進行分享；Twitter 上則特別集中討論學生佔領立法院事件，且根據簡體字內容推論關注來源包含大陸的網友。

3. 四大即時報報導內容分析可觀察到部分差異性，自由電子報主要強調反黑箱；

蘋果日報則偏向大量集中報導佔領國會現場新聞；中時電子報強調政治活動；

聯合新聞網則是傾向報導抗爭事件。

5.2 未來發展與建議

對於未來研究發展建議方面，以下就主題模型於中文資料分析之限制與應用延伸進行說明：

5.2.1 系統限制

依據本研究得到的結論，主題模型應用在 Twitter 短文性質的文本上，因平均字數不足而降低模型建置效度，故後續應用上建議可先在前處理階段過程中觀察評估文本的字詞特徵是否足夠用來訓練模型。

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2.2 系統之延伸應用

由於以關鍵字蒐集資料仍參雜大量且無直接相關的內容，過往以人工過濾清理資料方式需要投入許多時間精力，未來本系統將發展為蒐集巨量資料的快速篩選工具，在處理大量不同來源資料時，先透過計算各個來源主題的組成成為該資料集之特徵並加以利用作為篩選的條件，例如：若前三個主題就佔了 80%的資料量，

表示該資料集話題非常集中，以協助研究學者能夠更快速有效地對大量不同來源資料進行快篩和過濾的動作，排除混雜性較高或主題較為分散的資料集，再對篩選出的資料作進一步的運用。

‧

[1] Sullivan, Dan. (2001). Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing,and Sales. New York: John Wiley & Sons.

[2] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70).

[3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.

Harshman. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41,pp. 391-407.

[4] T. Hofmann. (1999). Probabilistic latent semantic indexing. presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA.

[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. (2003). Latent dirichlet allocation. J.

Mach. Learn. Res.,vol. 3,pp. 993-1022.

[6] M. Steyvers and T. Griffths. Probabilistic topic models. (2006).

[7] Hall, David, Daniel Jurafsky and Christopher D. Manning. (2008). Studying the history of ideas using topic models. Proceedings of the conference on empirical methods in natural language processing. Association for

Computational Linguistics.

[8] Phan, Xuan-Hieu, Le-Minh Nguyen, and Susumu Horiguchi. (2008).

Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th international

conference on World Wide Web. ACM.

[9] Xin Zhao, Jing Jiang, JianshuWeng et al. (2011). Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval.

‧

[10] Brody, Samuel, and Noemie Elhadad. (2010). An unsupervised aspect-sentiment model for online reviews. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.

[11] 楚克明， and 李芳. "基于 LDA 模型的新聞話題的演化." 计算机应

(2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).

[16] Newman, D., Lau, J. H. , Grieser, K. ,& Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics.

[17] Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A.

(2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp.

262-272). Association for Computational Linguistics.

[18] Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.ISO 690.

[19] Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science.

The Annals of Applied Statistics,17-35.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

[20] Maiya, A. S., & Rolfe, R. M. (2014). Topic similarity networks: visual analytics for large document sets. In Big Data (Big Data),2014 IEEE International Conference on (pp. 364-372). IEEE.

[21] Harris, Z. S. (1954). Distributional Structure. Word,10(2/3),146–162.

[22] Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. Communications of the ACM,15(12),1053-1058.

[23] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.

[24] Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007).

Subject metadata enrichment using statistical topic models. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 366-375). ACM.

[25] 謝宗震 (2014)。服貿事件 X 資料科學。檢自：http://readata.org/ecfa-and-data-science/

在文檔中基於主題模型之社群媒體內容分析探索 - 政大學術集成 (頁 68-73)

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2 未來發展與建議

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學