未來研究方向與建議 - 應用文本探勘技術於公開來源情報分析

本研究採用路透社的 Reuters-21578 文本集作為實驗資料，然而該文本集中各類別的文件數量極度不均，這在進行自我組織圖之訓練時，會導致神經元對某些類別的訓練不足，故而導致在後續的事件偵測上，有些類別的偵測準確率較低，

對這些文件含量過低的類別，無法進行有效的偵測，故如能改以文件數量分布較為平均的文本集進行訓練，相信能提高其偵測之準確度。

在未來的研究方向上，可以針對情報探勘的其他方向進行研究，如事件追蹤、

關鍵事物偵測等，此外，本研究僅針對單一語言進行分析，然有許多情報文件是跨語言的，若發展多國語言文件的偵測方法，將可發掘發生在不同國家之重要事件，且更具即時與豐富性。

參考文獻

[1] Johnston, R. “Analytic Culture in the US Intelligence Community: An Ethnographic Study”. Center for the Study of Intelligence, Central Intelligence Agency.

https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/b ooks-and-monographs/analytic-culture-in-the-u-s-intelligence-community/full_tit le_page.htm. Retrieved 2010-12-9.

[2] House of Representatives Report 109-89, National Defense Authorization Act for Fiscal Year 2006 (109th Congress 1st Session, May 20, 2005).

[3] Richard A. Best, Jr., Alfred Cumming. “Open Source Intelligence (OSINT):

Issues for Congress” Library of Congress Washington DC Congressional Research Service.2007.

[4] NATO. NATO Open Source Intelligence Handbook, Supreme Allied Commander Atlantic, Norfolk, VA, 2001.

[5] US Department of Army. FM 2-22.3 (FM 34-52) Human Intelligence Collector Operations, 2006.

[6] Heuer, Richards J. Jr., "Psychology of Intelligence Analysis. Chapter 2.

Perception: Why Can't We See What Is There To Be Seen?", History Staff, Center for the Study of Intelligence, Central Intelligence Agency, http://www.au.af.mil/au/awc/awcgate/psych-intel/art5.html. Retrieved 2010-12-9 [7] Gray, A. M. “Global Intelligence Challenges in the 1990's”, American

Intelligence Journal, 1990, pp. 37-41.

[8] NATO. NATO Open Source Intelligence Reader, Supreme Allied Commander Atlantic, Norfolk, VA, 2002.

[9] NATO. Intelligence Exploitation of the Internet, Supreme Allied Commander Atlantic, Norfolk, VA, 2002.

[10] Open Source Conference http://www.dniopensource.org/. Retrieved 2011-04-09.

[11] Neri, F. and Priamo, A. “SPYWatch, Overcoming Linguistic Barriers in Information Management,” Proceedings of the 1st European Conference on Intelligence and Security Informatics, vol. 5376, Intelligence and Security Informatics, 2008, pp. 51-60.

[12] Neri, F. and Geraci, P. “Mining Textual Data to Boost Information Access in OSINT,” in Proceedings of the 13th International Conference on Information Visualization, vol. IV, 2009, pp. 427-432.

[13] Pfeiffer, M., Avila, M., Backfried, G., Pfannerer, N., and Riedler, J. “Next Generation Data Fusion Open Source Intelligence (OSINT) System Based on

MPEG-7,” in Proceedings of the International Conference on Technologies on Homeland Security, Waltham, MA, 2008, pp. 41-46.

[14] Vincen, D., Stampouli, D., and Powell, G. “Foundations for System Implementation for a Centralised. Intelligence Fusion Framework for Emergency Services,” in Proceedings of the 12th International Conference on Information Fusion, Seattle, WA, 2009, pp. 1401-1408.

[15] Badia, A., Ravishankar, J., and Muezzinoglu, T. “Text Extraction of Spatial and Temporal Information,” in Proceedings of the 2007 International Conference on Intelligence and Security Informatics, 2007, pp. 381.

[16] Palmer, J. “Textually Retrieved Event Analysis Toolset,” in Proceedings of Military Communications Conference, Vol. 3, 2005, pp.1679-1685.

[17] Jiang, J. and Conrath, D. W. “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy,” in Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), 1997.

[18] Atkinson, M., Belayeva, J., Zavarella, V., Piskorski, J., Huttunen, S., Vihavainen, A., and Yangarber, R. “News Mining for Border Security Intelligence,” in Proceedings of IEEE International Conference on Intelligence and Security Informatics, 2010, p. 173.

[19] Tanev, H., Piskorski, J., and Atkinson, M. “Real-Time News Event Extraction for Global Crisis Monitoring,” in Proceedings of the 13th International Conference on Applications of Natural Language to Information Systems (NLDB 2008), Lecture Notes in Computer Science, vol. 5039, 2008, pp. 207-218.

[20] Piskorski, J., Tanev, H., Atkinson, M., and Van der Goot, E. “Cluster-Centric Approach to News Event Extraction,” in Proceedings of the International Conference on Multimedia & Network Information Systems, 2009, pp. 276-290.

[21] Grishman,R., Huttunen, S., and Yangarber, R. “Information Extraction for Enhanced Access to Disease Outbreak Reports,” Journal of Biomedical Informatics, vol. 35, No. 4, 2003, pp. 236-246.

[22] Yangarber, R., Jokipii, L., Rauramo, A., and Huttunen, S. “Extracting Information about Outbreaks of Infectious Epidemics,” in Proceedings of the HLT-EMNLP 2005, 2005, pp. 22-23.

[23] Wiil, U. K., Memon, N., and Karampelas, P. “Measuring Link Importance in Terrorist Networks,” in Proceedings of 2010 International Conference on Advances in Social Networks Analysis and Mining, 2010, pp.225-232.

[24] Wiil, U. K., Memon, N., and Karampelas, P. “Detecting New Trends in Terrorist Networks,” in Proceedings of 2010 International Conference on Advances in Social Networks Analysis and Mining, 2010, pp.435-440.

[25] Bartik, V. “Text-Based Web Page Classification with Use of Visual Information,”

in Proceedings of 2010 International Conference on Advances in Social Networks Analysis and Mining, 2010, pp. 416-420.

[26] Dawoud, K., Alhajj, R., and Rokne, J. “A Global Measure for Estimating the Degree of Organization of Terrorist Networks,” in Proceedings of 2010 International Conference on Advances in Social Networks Analysis and Mining, 2010, pp.421-427.

[27] Liu, H. K. and Sandfort, J. “A Case Study of Open Source and Public Participation in Catalyzing Social Innovations,” in Proceedings of 2010

International Conference on Advances in Social Networks Analysis and Mining, 2010, pp.428-431.

[28] Neri, F., Geraci, P., and Camillo, F. “Monitor the Web Sentiment, the Italian Prime Minister’s Case,” in Proceedings of 2010 International Conference on Advances in Social Networks Analysis and Mining, 2010, pp.432-434.

[29] Allan, J., Papka, R., & Lavrenko, V. “On-line new event detection and tracking,”

in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 37-45.

[30] Yang, Y., Pierce, T., & Carbonell, J. “A study of retrospective and on-line event detection,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 1998, pp. 28-36.

[31] Topic detection and tracking project. Homepage:

http://www.nist.gov/speech/tests/tdt Retrieved 2011-10-15.

[32] Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y., Umass, J. A., Cmu, B. A., et al. “Topic Detection and Tracking Pilot Study Final Report,” in Proceedings of the Darpa Broadcast News Transcription and Understanding Workshop, 1998, pp. 194-218.

[33]Stanford Log-linear Part-Of-Speech Tagger http://nlp.stanford.edu/software/tagger.shtml. Retrieved 2011-02-10.

[34] Porter, M.F. “An algorithm for suffix stripping” Program, vol. 14, No. 3, 1980, pp.130-137.

[35] Salton, G., Wong, A., and Yang, C.S. “A Vector Space Model for Automatic Indexing,” Communications of the ACM, vol. 18(11), 1975, pp.613-620.

[36] Kohonen, T. “Self-organized formation of topologically correct feature maps.”

Biological Cybernetics, vol. 43, 1982, pp. 59-69.

[37] Lewis, D. D., “Reuters-21578 Text Categorization Test Collection Distribution”, in AT&T Labs – Research, 1996.

附錄

門檻值 0.3 0.4 0.5 0.6 0.7

類別文件數準確率

oilseed 39 0.00% 10.26% 38.46% 66.67% 79.49%

orange 7 0.00% 0.00% 14.29% 71.43% 85.71%

palm-oil 6 0.00% 33.33% 33.33% 50.00% 83.33%

pet-chem 11 0.00% 18.18% 45.45% 63.64% 100.00%

rapeseed 9 0.00% 33.33% 44.44% 77.78% 88.89%

reserves 11 0.00% 9.09% 45.45% 54.55% 90.91%

retail 1 0.00% 0.00% 0.00% 0.00% 0.00%

rice 19 0.00% 15.79% 36.84% 47.37% 68.42%

rubber 10 0.00% 0.00% 10.00% 90.00% 100.00%

ship 77 1.30% 2.60% 10.39% 42.86% 67.53%

silver 5 0.00% 0.00% 0.00% 40.00% 80.00%

sorghum 7 0.00% 0.00% 42.86% 71.43% 71.43%

soy-meal 10 0.00% 0.00% 10.00% 40.00% 80.00%

soy-oil 8 0.00% 12.50% 50.00% 75.00% 100.00%

soybean 24 0.00% 4.17% 25.00% 50.00% 75.00%

stg 1 0.00% 0.00% 0.00% 0.00% 0.00%

strategic-metal 11 0.00% 9.09% 36.36% 36.36% 45.45%

sugar 31 0.00% 0.00% 12.90% 51.61% 77.42%

tin 10 0.00% 0.00% 0.00% 30.00% 60.00%

trade 81 0.00% 7.41% 23.46% 40.74% 61.73%

veg-oil 26 0.00% 11.54% 26.92% 46.15% 69.23%

wheat 62 0.00% 14.52% 46.77% 79.03% 90.32%

wpi 9 0.00% 11.11% 55.56% 77.78% 88.89%

yen 8 0.00% 0.00% 25.00% 87.50% 100.00%

zinc 12 0.00% 8.33% 16.67% 58.33% 66.67%

成功偵測數量 593 1034 1497 1890 2135

準確率 25.18% 43.91% 63.57% 80.25% 90.66%

附錄 2 新奇事件偵測準確率

門檻值 0.3 0.4 0.5

類別文件數準確率

bop 58 1 0.775862 0.37931 cpi 65 0.984615 0.753846 0.353846 gnp 67 1 0.970149 0.80597 housing 14 1 0.785714 0.142857 interest 308 0.935065 0.788961 0.428571 ipi 34 1 0.911765 0.558824 jobs 47 0.914894 0.595745 0.468085 money-fx 355 0.994366 0.864789 0.538028 money-supply 89 1 0.707865 0.359551 reserves 51 1 0.823529 0.588235 retail 14 1 0.785714 0.357143 ship 132 0.992424 0.939394 0.69697 trade 274 0.992701 0.908759 0.733577 wpi 21 1 0.619048 0.238095

成功偵測文件數 1265 1071 703

準確率 0.976834 0.827027 0.542857

在文檔中應用文本探勘技術於公開來源情報分析 (頁 45-51)