本研究之未來研究方向與重點

第七章結論與未來研究方向

7.4. 本研究之未來研究方向與重點

本論文將未來研究方向與重點大致上可分成下列幾點：

(1)尋求一個最佳解的方法(Optimal Solution)：

本研究在實驗的部份，為了找尋一個最好的空間維度以及最適合的相似臨界值，共進行了七項實驗，其主要目的為評估不同情形下最佳損益點與整體效能，最後藉由實驗數據方能找到一個針對該語料庫的最佳參數設定。針對上述需求，本研究希望後續未來的研究方向朝向於找尋一個最佳化的公式來解決一些繁雜的 Try and Error 的實驗。

(2)導入現有的搜尋引擎上：

由於在現有的搜尋引擎當中，至今仍無法無整地提供一個多語言或跨語言文件檢索或文件分類的服務，因此多國語言文件探勘將有可能成為下一代搜尋引擎的主要研究問題以及核心。透過引入多國語言文件探勘技術，

使用者不但可以使用單一個語系的查詢輸入下獲得多過一個語系的結果回傳，更能夠透過使用者的回饋(feedback)來不斷的修正檢索的品質以及分類的品質。此外，在搜尋引擎當中另外一個受到考量的重要因素就是時間，

過長的文件探勘將會導致不耐煩的使用者去選擇其他的搜尋引擎。本研究所提出的文件探勘方法除了在一開始的奇異值分解上可能需要龐大的計算時間，之後在檢索文件或是更新類別中心點都將只需要小量的數學運算，

可有效減少檢索所需的成本。至於在奇異值分解上，透過選擇背景執行的方式在離峰時段執行重新計算奇異值分解也是一個很好的選擇。

(3)LSI 應用於高向量維度的稀疏矩陣上：

在潛在語意索引的文件探勘技術上，其最主要核心就是奇異值分解。由

於在奇異值分解的時候需要配置大量的記憶體給矩陣內的每個元素，如此一來，當矩陣大小不斷的向上提高的時候將會面臨到一個相當嚴肅的問題，那就是當硬體環境上無法配置如此龐大記憶體時該如何進行相關的奇異值分解。因此本研究在未來工作上希望能夠有效的克服這項缺失，並能夠提出一個解決方案，使得本研究技術將不受到任何環境所限制。

(4)LSI 的運算平行化

由於奇異值分解的關係，程式將需要花費許多時間透過執行級數運算來近似原來的詞彙文件矩陣，在使用者預先定義的錯誤範圍內，將不斷地重複執行上述方法來縮小錯誤，這將會導致許多複雜的數學運算。如何將這些工作分散至個各不同電腦運算，將可進一步縮短整個探勘所需要的時間成本。叢集式運算是一個低成本高計算能力的計算平台，可以透過多個彼此連接的個人電腦所組成，透過額外的平行函式庫，即可建構出一個分散式的計算平台。本研究之未來工作預計透過叢集式平行分散式運算將奇異值分解的級數方法有效地平行，並階段性的完成整個文件探勘的平行化，

已達到決策的及時性。

參考文獻

英文文獻：

[1] Adeva J.J.G. , Calvo R.A. ,Ipina D.L.d. ,"2005","Multilingual Approaches To Text Categorization", The European Journal for the Informatics Professional

[2] Ampazis N. and Iakovaki H. ,"2005" ,"Cross-Language Information Retrieval Using Latent Semantic Indexing and Self-Organizing Maps", Proceedings of 2004 IEEE International Joint Conference on Neural Networks

[3] Ampazis N. and Perantonis S.J. ,”2004”,”LSISOM – A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collection”, Neural Processing Letters 19:157~173, 2004.

[4] Bel N. ,Koster C.H.A. ,and Villegas M. ,"2003","Cross-Lingual Text Categorization", In 7th European Conference on Research and Advanced Technology for Digital Libraries.

[5] Chau R. ,Yeh C.H. ,and Smith K.A. ,"2005","A Neural Network Model for Hierarchical Multilingual Text Categorization", International Symposium on Neural Networks

[6] Dumais S.T. ,Littleman M.L. ,and Landauer T.K. ,"1997","Automatic Cross-Language Retrieval Using Latent Semantic Indexing", AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval.

[7] Gilarranz J. ,Gonzalo J. ,and Verdejo F. ,"1996","An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", In Proceedings of AAAI-96 Spring Symposium Cross-Language Text and Speech Retrieval

[8] Gliozzo A. ,and Strapparava C. ,"2005","Cross Language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora", Proceedings of the ACL workshop on Building and Using Parallel Texts

[9] Honkela T. ,Kaski S. ,Lagus K. ,and Kohonen T. ,"1997","WEBSOM-Self-Organizing Maps of Document Collections", Proceedings of WSOM'97 Workshop on Self-Organizing Maps

[10] Joachims T. ,"2001","Learning to Classify Text Using Support Vector Machines", Kluwer Academic Publishers 2001.

[11] Karypis G. , Han E.H. , "2000","Fast Supervised Dimensionality Reduction Algorithm with Applications to Document Categorization &

Retrieval", In Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management

[12] Kaski S. ,Honkela T. ,Lagus K. ,and Kohonen T. ,"1998","WEBSOM-Self-Organizing Maps of Document Collections", Neurocomputing, Volume 21

[13] Kohonen T. ,"1999","The Self-Organizing Map", Proceeding of IEEE, VOL. 78, No. 9, September 1990

[14] Lee C.H. , Yang H.C. ,"2002","A Multilingual Text-Mining Approach based on Self-Organizing Maps for Semantic Web Mining”, 2002 IEEE International Conference on Systems, Man and Cybernetics

[15] Lee C.H. ,Yang H.C. ,"2005","A Classifier-Based Text Mining Approach for Evaluating Semantic Relatedness Using Support Vector Machines", Proceedings of the International Conference on Information Technology:

Coding and Computing(ITCC'05)

[16] Lee C.H. ,Yang H.C. ,Hsu F.C. ,Chen T.C. ,Hung C.C. ,"2005","A Multiple Classifier Approach for Measuring Text Relatedness Based on Support Vector Machines techniques", Proceedings of 9th World Multi-conference on Systemics, Cybernetics and Informatics (WMSIC2005)

[17] Liu T. ,Chen Z. ,Zhang B.Y. ,Ma W.Y. ,Wu G.Y. ,"2004","Improving Text Classification using Local Latent Semantic Indexing", Proceedings of the Tourth IEEE International Conference on Data Mining(ICDM'04) [18] Mathieu B. ,Besancon R. ,Fluhr C. ,"2002","Multilingual Document

Clusters Discovery", Proceedings of 2002 IEEE Conference on Systems, Man and Cybernetics

[19] Mitchell T.M. ,"1997", "Machine Learing", McGraw Hill, 1997

[20] Mori T. ,Kokubu T. ,and Tanaka T. ,"2001" ,"Cross-Lingual Information Retrieval Based on LSI with Multiple Word Spaces", Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization

[21] Mori Tatsunori, "2001" ,"Segmented LSI for Fully Automated Large-Scale Cross-Language Information Retrieval", In Proceedings of Natural Language Processing Pacific Rim Symposium '01 (NLPRS'01) [22] Nie J.Y. ,Simard M. ,Isabelle P. ,and Durand

R. ,"1999" ,"Cross-Language Information Retrieval Based on Parallel Texts and Automatic Mining Of Parallel Texts from the Web”, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information

[23] Oard D.W. ,”1997”,"Alternative Approaches for Cross-Language Text Retrieval", AAAI Symposium on Cross-Language Text and Speech

[24] Oard D.W. ,Dorr B.J. "1996","A Survey of Multilingual Text Retrieval"--UMIACS TR-96-19, University of Maryland

[25] O'Brien G.W. ,"1994","Information Management Tools for Updating an SVD-Encoded Indexing Scheme", Master's thesis, The University of Knoxville

[26] Porter M. F."1997","An algorithm for suffix stripping", Readings in information retrieval

[27] Rehder B. ,Littleman M.L. ,Dumais S.T ,Landauer T.K. ,", In The Sixth Text Retrieval Conference Notebook Papers (TREC6),

[28] Rigutini L. ,Maggini M. and Liu B. ,"2005" , "An EM based training algorithm for cross-language Text Categorization", Proceedings of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence [29] Steinberger R. ,Hagman J. ,Scheer S. ,"2000","Using Thesauri for

Automatic Indexing and for the Visualisation of Multilingual Document Collections", Workshop on Ontologies and Lexical Knowledge Bases(OntoLex'2000)

[30] Sun J.T. ,Chen Z. ,Zeng H.J. ,Lu Y.C. ,Shi C.Y. ,Ma W.Y. ,"2004","Supervised Latent Semantic Indexing for Document Categorization", Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04)

[31] Vapnik, V. , 1995, "The Nature of Statistical Learning Theory.", Springer, N.Y., 1995. ISBN 0-387-94559-8.

[32] Vapnik, V., 1998, "Statistical Learning Theory.", Springer, N.Y., 1998.

[33] Vapnik,V., 1999, "An Overview of Statistical Learning Theory" , IEEE Transactions on Neural Networks, Vol. 10, No. 5, pp.988-999, 1999.

[34] Wermter S. ,and Hung C.L. ,"2002","Selforganizing classification on the Reuters news corpus", Proceedings of the 19th international conference on Computational linguistics - Volume 1

[35] Wim P. ,and Ivonne P. ,"1998","Automatic Sense Clustering in EuroWordNet", Proceedings of the 1st international conference on Language Resources and Evaluation.

[36] Yang Y.M. ,"1999","An Evaluation of Statistical Approaches to Text Categorization", In Information Retrieval Kluwer Academic Publishers [37] Zelikovitz S. ,and Hirsh H. ,"2001","Using LSI for Text Classification in

the Presence of Background Text", Proceedings of the tenth international conference on Information and knowledge management

中文文獻：

[38] 徐豐智,2005,Support Vector Machines 技術應用於文件相關性量測之探討,國立高雄應用科技大學電機工程系碩士班碩士論文,2004.

[39] 黃雲龍,1996,中文全文文件群集索引理論研究-向量空間模型的建構, 國立台灣大學商學研究所博士論文,1996.

在文檔中一個監督式學習與非監督式學習技術應用於多國語言文件探勘之比較研究 (頁 107-114)

第七章 結論與未來研究方向

7.4. 本研究之未來研究方向與重點

參考文獻

第七章結論與未來研究方向