• 沒有找到結果。

從中文社群媒體的應用上出發,以太陽花學運為個案,採取 Topic Coherence 和 Topic Distance 兩項度量進行評估,提供系統建置模型效果評估參照。該個案收 集有六個不同來源的資料集,經本研究發現 Twitter 資料集在 Topic Coherence 指 標中表現最差,是故本系統較不適合應用於內容字數太少之資料。但對於自動解

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

59

題下的文件數量,以瞭解各個來源的主題組成比例和傾向。我們發現 Twitter 和 蘋果日報話題最為集中。

最後,就本研究實驗所得分析結果彙整說明如下:

1. 議題多為反服貿相關討論。

2. 實驗結果顯示在 Facebook 上關心香港政治、反核、綠黨的社團亦特別有意 識關注學運話題並進行分享;Twitter 上則特別集中討論學生佔領立法院事 件,且根據簡體字內容推論關注來源包含大陸的網友。

3. 四大即時報報導內容分析可觀察到部分差異性,自由電子報主要強調反黑箱;

蘋果日報則偏向大量集中報導佔領國會現場新聞;中時電子報強調政治活動;

聯合新聞網則是傾向報導抗爭事件。

5.2 未來發展與建議

對於未來研究發展建議方面,以下就主題模型於中文資料分析之限制與應用延伸 進行說明:

5.2.1 系統限制

依據本研究得到的結論,主題模型應用在 Twitter 短文性質的文本上,因平均字 數不足而降低模型建置效度,故後續應用上建議可先在前處理階段過程中觀察評 估文本的字詞特徵是否足夠用來訓練模型。

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

60

5.2.2 系統之延伸應用

由於以關鍵字蒐集資料仍參雜大量且無直接相關的內容,過往以人工過濾清理資 料方式需要投入許多時間精力,未來本系統將發展為蒐集巨量資料的快速篩選工 具,在處理大量不同來源資料時,先透過計算各個來源主題的組成成為該資料集 之特徵並加以利用作為篩選的條件,例如:若前三個主題就佔了 80%的資料量,

表示該資料集話題非常集中,以協助研究學者能夠更快速有效地對大量不同來源 資料進行快篩和過濾的動作,排除混雜性較高或主題較為分散的資料集,再對篩 選出的資料作進一步的運用。

[1] Sullivan, Dan. (2001). Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing,and Sales. New York: John Wiley & Sons.

[2] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70).

[3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.

Harshman. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41,pp. 391-407.

[4] T. Hofmann. (1999). Probabilistic latent semantic indexing. presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA.

[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. (2003). Latent dirichlet allocation. J.

Mach. Learn. Res.,vol. 3,pp. 993-1022.

[6] M. Steyvers and T. Griffths. Probabilistic topic models. (2006).

[7] Hall, David, Daniel Jurafsky and Christopher D. Manning. (2008). Studying the history of ideas using topic models. Proceedings of the conference on empirical methods in natural language processing. Association for

Computational Linguistics.

[8] Phan, Xuan-Hieu, Le-Minh Nguyen, and Susumu Horiguchi. (2008).

Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th international

conference on World Wide Web. ACM.

[9] Xin Zhao, Jing Jiang, JianshuWeng et al. (2011). Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval.

[10] Brody, Samuel, and Noemie Elhadad. (2010). An unsupervised aspect-sentiment model for online reviews. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.

[11] 楚克明, and 李芳. "基于 LDA 模型的新聞話題的演化." 计算机应

(2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).

[16] Newman, D., Lau, J. H. , Grieser, K. ,& Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics.

[17] Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A.

(2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp.

262-272). Association for Computational Linguistics.

[18] Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.ISO 690.

[19] Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science.

The Annals of Applied Statistics,17-35.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

63

[20] Maiya, A. S., & Rolfe, R. M. (2014). Topic similarity networks: visual analytics for large document sets. In Big Data (Big Data),2014 IEEE International Conference on (pp. 364-372). IEEE.

[21] Harris, Z. S. (1954). Distributional Structure. Word,10(2/3),146–162.

[22] Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. Communications of the ACM,15(12),1053-1058.

[23] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.

[24] Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007).

Subject metadata enrichment using statistical topic models. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 366-375). ACM.

[25] 謝宗震 (2014)。服貿事件 X 資料科學。檢自:http://readata.org/ecfa-and-data-science/

相關文件