建議與後續研究方向

第五章結論與建議

第二節建議與後續研究方向

以下將提出幾點目前研究中，有待改善之處作討論，同時提出本研究可繼續延伸的議題與相關研究，以作為後續研究發展的方向。

1. 在比較知識特徵項目部分時，本研究只是很簡單地去比較最終的正確率而已，但應可以更細部去探討整體分類過程，每份測試文件，

實際因為知識特徵項目所提升之效果。因為本方法是以計分方式為分類基礎，因此應可以設計出一套指標，來衡量此部分實際提升的效果。

2. 此外，參考現今一些文件分類方法，多具有加入使用者資訊(user information)的部分，以藉此提升正確率，而本方法原始架構並無包含此特性，同時也無回饋機制(feedback)的設計，因此無法將每次分類結果再進行學習與改進。如果針對此部分的概念再整合至本方法中，應可對於整體分類效果有所助益。

3. 由於本分類方法主要是以追求最高正確率為目標，因此對於分類所需花費的時間並無考慮，此部分對於後續研究者，亦是一個可繼續探討之方向。

4. 在研究過程中，亦有發現同義字問題與多重分類問題，在本研究中，

只將此兩部分作簡單假設或放至附錄部分，並無深入探討之，如將此問題解決，並整合至本研究方法之中，亦可使本方法更趨完備。

參考文獻

中文部分

葉怡成，”類神經網路模式應用與實作”，儒林圖書有限公司，民國 90 年七版

英文部分

Aiello, M., C. Monz, L. Todoran. 2002. Document understanding for a broad class of documents. International Journal on Document Analysis and Recognition 5(1) 1-16.

Allan, J.(editor), B. Croft(editor). 2003. Challenges in Information Retrieval and Language Modeling. ACM SIGIR Forum 37(1)

Belkin, N. J., W. B. Croft. 1992. Information Filtering and Information Retrieval: Two Sides of the Same Coin?. Communications of ACM 35(12) 29-38.

Bertino, E., G. Guerrini, M. Mesiti. 2002 Matching an XML Document against a Set of DTDs. Proceeding of the Thirteenth International Symposium on Methodologies for Intelligent Systems pp.412-422.

Bertino, E., G. Guerrini, M. Mesiti. 2003. A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and its Applications. Information Systems 29(1) 23-46.

Borko, H., M. Bernick. 1963. Automatic Document Classification. Journal of the ACM 10(1) 151-162.

Chen, Y. S., T. H. Chu, 1995. A Neural Network Classification Tree. IEEE International Conference on Neural Networks pp.409-413.

Chisholm, E., T. G. Kolda, 1998. New term weighting formulas for the vector space method in information retrieval. Report ORNL/TM-13756, Computer Science and Mathematics Division, Oak Ridge National Laboratory.

Heaps, H.S. 1973. A Theory of Relevance for Automatic Document Classification. Information and Control 22(3) 268-278.

Jacobes, P. S. 1993. Using Statistical Mehods to Improve Knowledge-based News Categorization. IEEE expert 8(2) 13-23.

Jain, A. K., M. N. Murty, P. J. Flynn. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3) 264-323.

Jenkins, C., D. Inman. Adaptive Automatic Classification on the Web. 2000.

11th International Workshop on Database and Expert Systems Application pp.504-511.

Lee, J. Y., J. S. Park, H. Byun, J. Moon, S. W. Lee. 2002. Automatic Generation of Structured Hyperdocuments From Document Images. Pattern Recognition 35(2) 485-503.

Lewis, D. D., R. E. Schapire, J. P. Callan, R. Papka. 1996. Training Algorithms for Linear Text Classifiers. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pp.298-306.

Mostafa, J., W. Lam. 2000. Automatic Classification Using Supervised Learning in A Medical Document Filtering Application. Information Processing & Management 36(3) 415-444.

NietoSanchez, S., E. Triantaphyllou, D. Kraft. 2002. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

Information Processing and Management 38(4) 583-604.

Oracle TextServer3 Administrator's Guide,

URL:http://otn.oracle.co.kr/docs/oracle78/txtsvr30/tsad/ch15.htm

Richard F. E. S. 1991. Distributed representations in a text based information retrieval system: a new way of using the vector space model. Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval pp.123-132.

Ruge, G. 1997. Automatic Detection of Thesaurus Relations for Information Retrieval Applications. Foundations of Computer Science: Potential - Theory - Cognition pp.499-506.

Salton, G., A. Wong, C. S. Yang. 1975. A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11) 613-620.

Salton, G., C. Buckley. 1991. Automatic Text Structuring and Retrieval-experiments in Automatic Encyclopedia Searching. Proceedings of the Fourteenth International ACM SIGIR Conference on Research and Development in Information Retrieval pp.21-30.

Tai, S. M., C. Z. Yang, I. X. Chen. 2002a. Improved Automatic Web-Page Classification by Neighbor Text Percolation. Proceedings of the 8th International Conference on Information Management pp.289-296.

Tai, X., M. Sasaki, Y. Tanaka, K. Kita. 2000. Improvement of vector space information retrieval model based on supervised learning. Proceedings of the fifth international workshop on on Information retrieval with Asian languages pp.69-74.

Tai, X., F. Ren, K. Kita. 2002b. An information retrieval model based on vector space method by supervised learning. Information Processing and Management 38(6) 749-764.

W3C. Extensible Markup Language, URL:http://www.w3.org/TR/REC-xml.

Yahoo!, URL:http://www.yahoo.com

Yi, J., N. Sundaresan. 2000. A Classifier for Semi-structured Documents.

Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp.340-344.

Zhu, L., A. Rao, A. Zhang. 2002. Advanced Feature Extraction for Keyblock-based Image Retrieval. Information Systems 27(8) 537-557.

Zisman, A. 2000. An Overview of XML. Computing & Control Engineering Journal 11(4) 165-167.

附錄 A:本研究原始收集之 XML 文件來源

本研究原始由 9 套不同的電腦軟體，共收集了 468 份的 XML 文件，

經過開啟並查看每一份文件實際內容之後，再將其歸類到不同的分類項目之下。而表 A-1 則為原始從每一套軟體所收集的的 XML 文件數目及其各自的版本。

表 A-1 本研究原始收集之 XML 文件來源

分類編號來源軟體版本 XML 文件數

目

1

Microsoft Search

9.107.

5512.0

14

2

SQL Server 2000

15 3

Microsoft ACT 1.0

25

4

Office XP

20

5

.Net Framework 1.0

40

6

Matlab 6.5

61

7

Winamp 3.0 72

8

Dreamweaver MX

86 9

XML Spy Enterprise version 5.0

135

合計:

468 附錄 B:各分類項目 metadata 之來源

HTML editors:

Joyce, J. E. 2003. Dreamweaver MX :/complete course. Wiley Pub., New York.

Mathematics software:

Desmond, J. H., N. J. Higham. 2000. MATLAB guide. Society for Industrial and Applied Mathematics, Philadelphia.

Media player:

http://www.winamp.com/support/

Ofiice suites:

Jodi, D., C. Greaves, M. Groh, B. Hallberg, M. Harding, F. Houlette, R.

Tidrow. 1994. Inside Microsoft Office professional. New Riders Pub., Indianapolis, Ind.

Servers:

http://www.microsoft.com/applicationcenter/techinfo/productdoc/default.asp

Web programming:

Eric, B., H. H. Feng, E. L. W. Soong, D. Zhang. S. S. Zhu. 2002. Fundamentals of Web applications using .NET and XML. Prentice Hall, Upper Saddle River, NJ.

XML:

http://www.altova.com/support_help.html

附錄 C 多重分類功能

在本研究進行過程中，發現由於本方法是採用計算得分的方式，藉此來對未知的文件作分類。而根據這個特色，可以設計出一套指標，來將該未知文件不僅僅分成單一分類，而可以排出它在各個分類的相似度排名。

因此在此附錄部分，將就本研究方法的多重分類功能作一介紹與說明之。

本研究針對每一未知文件在所有的分類項目中，定義了各自的相似度

Similarity_class，計算方式就是將該份未知文件在每一個分類的得分

Scoreclass，除於所有分類中最大的得分數 Max(Scoreclass)，如式子 C-1 所示：

Similarity_class = Score_class / Max(Score_class) class=1,2,3……n (C-1)

例如圖 C-1 所示，假設共有 9 個分類，則該份文件隸屬於第五個分類(因為其在該分類得分最高)，但它在其他的分類項目，都有不同的相似度數值。

圖 C-1 未知文件於各分類的相似度

在文檔中 XML 文件分類方法之研究 (頁 59-67)

第五章 結論與建議

第二節 建議與後續研究方向