第五章 結論與未來研究方向
第二節 未來研究方向
本論文將傳統資訊擷取用於建立文件索引的 TF*IDF 方法從不具語意提升至 具有語意的文件索引,在判斷語意時利用語彙鍵結並加以改良,提出複合語意權 重表示法以判斷出字詞的語意,進而提昇文件分群的正確性,並作為推薦使用者 文獻的依據。
未來我們將針對以下幾點進行更深入的研究:
1. 新增語意歧異解析的策略:本論文提出關於鍵結擴充的兩個策略,其目 的為減少無法判斷出語意的比率,但相對地會增加判斷錯誤的比率。未 來可以研究朝這兩方面同時改善的策略著手。
2. 加入動詞語意歧 異解析:本論文只針對文件中的名詞進行語意歧異解 析,不過我們提出的方法只要有一完善的字典且該字典定義豐富的詞彙 語意集合關係可供對照的話,則可以適用在任何詞性中。若一篇文件能 夠同時判斷出名詞與動詞的語意,則此文件內容的涵義將更容易了解。
3. 擴充使用者可自定分群的功能:在自定分群的功能上,未來可以加入更 彈性的自定分群設計,例如使用者可以選擇保留一整群的文件內容,或 保留某群中的某些文件,或者使用者可以將感興趣的文件組織成一群等。
4. 使用者推薦滿意度調查:系統推薦給使用者的文獻,若使用者回傳一些 資訊給系統的話,藉由不斷的改善及學習,可以讓系統更為完善。
附錄 A:Stemming-Porter’s Algorithm [25]
V 代表母音(vowel, 是為 a, e, i, o, u);C 代表子音 (consonant);L 代表一般 的字母 (vowel or consonant),任何 C, V, L 的組合稱為樣型 (patterns);Ø 代表空 字串 (one with no letters);* 代表重複 0 次以上的樣型;+ 代表重複 1 次以上的 樣型。
select rule with longest suffix { sses à ss;
ies à i;
ss à ss;
s à O;
}
select rule with longest suffix {
if ((C)*((V)+(C)+)+(V)*eed) then eed à ee;
if (*V*ed or *V*ing) then { select rule with longest suffix {
ed à O;
ing à O;
}
select rule with longest suffix { at à ate;
entli à ent;
ous à Ø ; ive à Ø ; ize à Ø ;
if (*s or *t) then ion à Ø ; }
select rule with longest suffix {
if ((C)*((V)+(C)+)((V)+(C)+)+(V)*) then e à Ø ;
if (((C)*((V)+(C)+)(V)*) and not ((*C1V1C2) and (C2 ∉ {w, x, y}))) then eà nil;
}
if ((C)*((V)+(C)+)((V)+(C)+)+V*ll) then ll à l;
參考文獻
[1] G. Salton, Automatic text processing. Addison-Wesley, 1989.
[2] G.A. Miller, “WordNet: An On-line Lexical Database,” International Journal of Lexicography, vol. 3, no. 4, pp.235-312, 1990.
[3] R. Mihalcea and D.I. Moldovan, “Word Sense Disambiguation Based on
Semantic Density,” Use of WordNet in National Language Processing Systems:
Proceedings of the Conference, 1999.
[4] G.A. Miller, M. Chodorow, S. Landes, C. Leacock and R.G. Thomas, “Using a Semantic Concordance for Sense Identification,” Proceedings of the ARPA Human Language Technology Workshop, pp.240-243, 1994
[5] X. Li, S. Szpakowicz and S. Matwin, “A WordNet-Based Algorithm for Word Semantic Sense Disambiguation,” Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI-95, pp.1368-1374, 1995.
[6] R. Barzilay and M. Elhadad, “Using Lexical Chains for Text Summarization,”
ACL/EACL Workshop on Intelligent Scalable Text Summarization, 1997.
[7] D.I. Moldovan and R. Mihalcea, “Using WordNet and Lexical Operators to Improve Internet Searches,” Internet Computing, IEEE, vol. 4, no. 1, pp.34-43, 2000.
[8] A. Montoyo and M. Palomar, “Word Sense Disambiguation with Specification Marks in Unrestricted Texts,” Proceedings of 11th International Workshop on Database and Expert Systems Applications, IEEE, pp.103-107, 2000.
[9] H.T. Ng and H.B. Lee, “Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Examplar-Based Approach,” Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL-96), pp.40-56,
[10] A. Suarez, M. Noeda and M. Palomar, “A Method of Restricted Knowledge Acquisition from WordNet,” Proceeding of the 3rd International Conference on Knowledge-Based Intelligent Information Engineering System, IEEE, pp.38-41, 1999.
[11] J. Stetina, S. Kurohashi and M. Nagao, “General Word Sense Method Based on a Full Sentential Context,” Use of WordNet in National Language Processing Systems: Proceedings of the Conference, 1998.
[12] D. Yarowsky, “Unsupervised Word Sense Disambiguation rivaling Supervised Method,” Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp.189-196, 1995.
[13] P. Resnik and D. Yarowsky, “A Perspective on Word Sense Disambiguation Methods and Their Evaluation,” Proceedings of ACL Siglex Workshop on Tagging Text with Lexical Semantics, Why, What and How? Washington, pp.79-86, 1997.
[14] P. Resnik, “Selectional Preference and Sense Disambiguation,” Proceedings of ACL Siglex Workshop on Tagging Text with Lexical Semantics, Why, What and How? Washington, 1997.
[15] B. Sankaran and V. Vaidebi, “Role of Collocations and Case-Markers in Word Sense Disambiguation: A Clustering-Based Approach,” IEEE International Conference on Systems, Man and Cybernetics, vol. 1, pp.625-630, 2002.
[16] R. Bruce and J. Wiebe, “Word Sense Disambiguation using Decomposable Models,” Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp.139-146, June 1994.
[17] G. Rigau, J. Atserias and E. Agirre, “Combining Unsupervised Lexical
Knowledge Methods for Word Sense Disambiguation,” Proceedings of the 35th
[18] K. Lindent and K. Lagus, “Word Sense Disambiguation in Document Space,”
IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp.288-293, 2002.
[19] G.. Sidorov and A. Gelbukh, “Automatic Detection of Semantically Primitive Words Using Their Reachability in an Explanatory Dictionary,” IEEE International Conference on Systems, Man, and Cybernetics, vol. 3, pp.1683-1687, 2001.
[20] A. Hardy, “On the Number of Clusters,” Computational Statistics and Data Analysis, vol. 23, pp.83-96, 1996.
[21] G. Milligan and M. Cooper, “An Examination of Procedures for Determining the Number of Clusters in a Data Set,” Psychometrika, vol. 50, no. 2, pp.159-179, 1985.
[22] A.H. Tan, “Personalized Information Management for Web Intelligence,”
Proceedings of the 2002 IEEE International Conference on FUZZ-IEEE'02, vol. 2, pp.1045-1050, 2002.
[23] A.H. Tan, FOCI (search, cluster and personalize WWW, Patents, Publication and News). Available at http://textmining.krdl.org.sg/FOCI/
[24] M. Geffet and G. Feitelson, “Hierarchical Indexing and Document Matching in BoW,” Joint ACM/IEEE Conf. Digital Libraries, pp.259-267, 2001. Available at http://www.bow4.cs.huji.ac.il/bow/
[25] B.Y. Richard and R.N. Berthier, Modern Information Retrieval. Addison- Wesley, ACM Press New York, 1999
[26] Digital Library Federation, “A Working Definition of Digital Library,” 1998.
Available at http://www.clir.org/diglib/dldefinition.htm
[27] RefWorks (Your Personal Web-based Database and Bibliography Creator).
[28] WordNet (a lexical database for the English language). Available at http://www.cogsci.princeton.edu/~wn/