實驗五 - 實驗設計與結果 - 系統實作與實驗結果與分析 - 以自動摘要提昇中文文件分類之效能

4 系統實作與實驗結果與分析

4.5 實驗設計與結果

4.5.5 實驗五

50%

1000 2000 3000 4000 5000 維度數量

Macro-F KNN分類器

本研究分類器 SVM分類器

圖 4-14 分類器分類結果比較圖(Macro-F) (資料集 B)

綜合以上兩個資料集之實驗結果，本研究所提之分類器與SVM之效能可說是沒有顯著差異。本研究分類器之優點在於會同時產生指示性摘要，供用戶評估分類是否恰當。且由於是以類別為向量單位，測試文件(或新文件)在決定所屬之類別的運算上較以文件為向量單位快出許多。值得一提的是，SVM在不同的維度上的效能是差不多的，此呼應在第二章提到的SVM通常不需要進行特徵字的選取，它會在演算中過濾出重要的特徵字出來；故維度由1000增至5000時，其效能的改善並沒有很明顯。相對而言，傳統KNN則因資料集所映射出的文件向量⁸中可能存在太多雜訊，因此其分類結果並不理想。

4.5.5 實驗五

實驗五之目的在於了解加入同義詞的比對對提昇分類正確性之影

響。其中分類器特徵維度N 分別設為400及1000，且同義詞詞典實驗門檻值β設定為0.4及0.9。分類器會採用類別為特徵向量與文件為特徵向量這兩種向量單位來做實驗。以2006/2/2至2006/2/14 yahoo電子新聞中2161篇的十個群組做為資料集(如表 4-30所示)，訓練資料為百分之三十，測試資料為百分之七十。

表 4-30 用於同義詞實驗之資料集 2/2～2/14

類別文件數

政治 208

運動 225

財經 221

影視 204

科技 220

兩岸 224

休閒 219

社會 210

國際 207

健康 223

TOTAL 2161

表 4-34

實驗結果如所示，顯示加入同義詞詞典並未改善其效能，其主要原因可能是同義詞詞典的詞大多是一般字，因此較重要的專業用詞反而可能無法找到其同義詞。以表 4-31為例，該分類加入同義詞詞典(其相似度門檻值β設為0.4)後，該文件原屬於財經類別，卻被分類為政治類。

的詞都是為一般用詞，如：今天－明天、未來－明天、沒有－低於等等用詞，反而造成比對上產生雜訊，而導致效能降低。

表 4-31 加入同義詞典方法下，某一文件與政治和財經比較結果

政治類向量文件向量相似度財經類向量文件向量相似度

台灣台灣 1.0 今天明天 0.4

未來明天 1.0 今年今年 1.0

沒有低於 0.4 行政院行政院 1.0

建立建設 0.4 行政院委員會 0.4

政府委員會 0.4 增加增加 1.0

討論討論 1.0

國會會議 0.4

表 4-32 以類別為特徵向量單位，未採用同義詞詞典之分類結果

維度：400 維度：1000

類別文件數分類正確 Recall Precision F-measure 類別文件數分類正確Recall Precision F-measure 政治 145 111 0.77 0.45 0.57 政治 145 116 0.80 0.54 0.64 運動 157 127 0.81 0.88 0.84 運動 157 134 0.85 0.88 0.87 財經 154 73 0.47 0.53 0.50 財經 154 82 0.53 0.59 0.56 影視 142 86 0.61 0.74 0.67 影視 142 94 0.66 0.80 0.73 科技 154 70 0.45 0.41 0.43 科技 154 88 0.57 0.49 0.53 兩岸 156 98 0.63 0.50 0.56 兩岸 156 96 0.62 0.54 0.55 休閒 153 92 0.60 0.56 0.58 休閒 153 101 0.66 0.64 0.59 社會 147 75 0.51 0.68 0.58 社會 147 77 0.52 0.63 0.57 國際 144 56 0.39 0.58 0.46 國際 144 65 0.45 0.58 0.53 健康 156 98 0.63 0.80 0.70 健康 156 109 0.70 0.83 0.63 TOTAL 1508 886 59% 61% TOTAL 1508 962 64% 57%

Accuracy 58.753% Accuracy 63.793%

表 4-33 以文件為特徵向量單位，未採用同義詞詞典之分類結果

維度：400 維度：1000

類別文件數分類正確 Recall Precision F-measure 類別文件數分類正確Recall Precision F-measure 政治 145 106 0.73 0.47 0.57 政治 145 112 0.77 0.51 0.61 運動 157 135 0.86 0.86 0.86 運動 157 137 0.87 0.90 0.89 財經 154 74 0.48 0.56 0.52 財經 154 74 0.48 0.54 0.51 影視 142 83 0.58 0.61 0.60 影視 142 90 0.63 0.78 0.70

科技 154 85 0.55 0.41 0.47 科技 154 97 0.63 0.42 0.50 兩岸 156 90 0.58 0.50 0.54 兩岸 156 85 0.54 0.52 0.53 休閒 153 89 0.58 0.58 0.58 休閒 153 101 0.66 0.60 0.63 社會 147 73 0.50 0.62 0.55 社會 147 73 0.50 0.62 0.55 國際 144 51 0.35 0.71 0.47 國際 144 52 0.36 0.70 0.48 健康 156 98 0.63 0.75 0.68 健康 156 105 0.67 0.80 0.73 TOTAL 1508 884 58% 61% TOTAL 1508 926 61% 64%

Accuracy 58.621% Accuracy 61.406%

表 4-34 加入同義詞詞典比較結果表

Accuracy

同義詞詞典未採用同義詞

維度門檻值：0.4 門檻值：0.9 類別為向量文件向量 400 53.183% 57.361% 58.753% 58.621%

1000 60.146% 62.202% 63.793% 61.406%

5 結論及未來研究方向

5.1 結論

本論文提議以摘要來產生特徵向量，取代現有的維度選取法；我們的基本論點是：摘要為一篇文章之最精華，故由其產生特徵向量應是最精純及最具代表性。為瞭解由文件摘要(修改現在的摘要產生演算法)形成之特徵向量是否能有效提升中文文件分類的效能，本論文進行了五個實驗。實驗結果顯示，以摘要形成的特徵向量可以得到不錯的中文文件分類效能，其效能優於KNN分類器並與SVM分類器相當；且當以類別為向量單位時其分類效果又比以文件為向量單位時要好；此呼應Occam’s Razor：最好的理論應同時具備正確性與精簡性(或優雅性)。最後，本研究衍生的好處是可同時產生文件的指示性(indicative)摘要，供使用者在依主題查詢時，可快速知曉其文件之概念。

具體貢獻可描述如下：就學術社群而言，本研究所提方法可供後續學者解決維度空間過大問題的另一個選擇方案；就實務的應用而言，許多的學術論文已存有摘要，當欲進行文件分類時，可透過本研究方法進行分類，如此可避免人工分類太過主觀與錯誤。例如，ICIM2006將某一文件分類的論文歸為軟體工程的主題。

5.2 未來研究方向

本研究主要是針對中文文件分類，而主要限制如下：目前本方法僅以最鄰近法為分類器進行測試，其它的分類器方法適用性，仍有待測試。

其次，目前測試的類別僅有十大類，在真實應用上，其類別數會更多(達 20類以上)，其適用性仍待檢驗。最後，由於本研究之特徵選取是採摘要法，極短的文章分類是無法採用本方法。因此，本研究分類文章的長度必需限制在三百字以上，才能有具體的效果。

未來研究可朝以下三方向進行：(1)找出詞彙間的相關性：除了用同義詞來擴充詞彙的比對外，可嘗試由語料庫本身詞彙間的共現關係 (co-occurrence)或由詞的分佈分群法(term distributional clustering)來找出不同詞彙之相關性，以提升分類的正確性。(2)套用其它分類技術：目前本論文僅以最鄰近法進行測試，未來可考慮套用本研究於其它分類器(e.g.

SVM或Naïve Bayes)，以瞭解其適用性。(3) 修正同義詞的距離計算：目前所採用的同義詞距離計算是將位於同義詞林相同節點的詞的相似度視為1，但這種簡化可能與真實情況不符；同義詞並不一定就是相等。某些時候作者會刻意選擇不同詞來表達概念間的細微差異，例如，好、佳、

及優良尚存在著程度上的差異。因此選擇更周延的距離(相似度)計算是另

6 參考文獻

1. 中文斷詞系統網址(中研院)，http:// ckipsvr.iis.sinica.edu.tw/

2. 何容(1950)，簡明國語文法，正中書局，台北市，第 3~5 頁。

3. 陳信希(2000)，「自動摘要方法之研究:單一中文文本之摘要」，行政院國家科學委員會研究計畫，計劃編號：NSC89-2213-E002-064。

4. 劉群、李素建(2002)，「基於《知網》的詞彙語義相似度計算」，第三屆漢語詞彙語義學研討會論文集，臺北:，pp. 59-76。

5. 《同義詞詞林(擴展版)》說明文件(2005)，哈爾濱工業大學資訊實驗室提供。

6. 《同義詞詞林(擴展版)》(2005)，哈爾濱工業大學資訊實驗室提供。

7. 譚敏(2003)，全方位標點符號入門，中經社，台北縣中和市，第 9~104 頁。

8. 蕭文峰、胡國信(2005)，「具分群機制之遞增式最鄰近分類學習法」，

第十六屆國際資訊管理學術研討會論文集，輔仁大學，台北。

9. 蕭文峰、劉凱帆(2006)，「以自動摘要為基礎之中文文件分類器」，

第十七屆國際資訊管理學術研討會。

10. Aas, K. and Eikvil, L. (1999), “Text Categorization: A Survey,”

Technical report, Norwegian Computing Center, Junho.

11. Buckley, C., Salton, G., Allan, J. and Singhal, A. (1994), "Automatic Query Expansion Using SMART TREC 3," In Proceeding 3rd Text Retrieval Conference, NIST.

12. Baker, L.D. and Mccallum, A.K. (1998), "Distributional clustering of words for text classification,” In Proceedings of the 21th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pp. 96-103.

13. Chen, K.H. (1995), “Topic Identification in Discourse,” In Proceedings of the 7th Conference of the European Chapter of Association for Computational Linguistics, pp. 267-271, Dublin, Ireland.

14. Chen, K.H., Chen, H.H. (1995), “A Corpus-Based Approach to Text Partition,” In Proceedings of the Workshop of Recent Advances in Natural Language Processing , pp. 152-161, Sofia, Bulgaria.

15. Chen, K.H., Huang, S.J., Lin, W.C., and Chen, H.H. (1998), “An NTU-Approach to Automatic Sentence Extraction for Summary Generation,” In Proceedings of the First Automatic Text Summarization Conference (SUMMAC-1), pp. 163-170, Virginia, May.

16. Chen, H.H., Kuo, J.J., Huang, S.J., Lin, C.J., and Wung, H.C. (2003),

"A Summarization System for Chinese News from Multiple Sources,"

Journal of American Society for Information Science and Technology, 54(13), November 2003, pp. 1224-1236.

17. Calvo, R.A., Lee, J.M., & Li, X.B. (2004), “Managing content with automatic document classification,” Journal of Digital Information, 5(2) . Article No. 282.

18. Dumais, S. T. (1991), “Improving the retrieval of information from external sources,” Behavior Research Methods, Instruments and Computers, 23(2), pp. 229-236.

19. Hearst, M.A. (1999), “Untangling text data mining,” Proceedings of ACL’99: the 37th annual meeting of the association for computational linguistics, University of Maryland.

20. Joachims, T. (1998), “Text categorization with support vector machines:

learning with many relevant features,” In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), pp. 137–142.

21. Luhn, H.P. (1958), “The automatic creation of literature abstracts,”

I.B.M. Journal of Research and Development, 2 (2), pp. 159-165.

22. Li, S.J., Zhang, J., Huang, X., Bai, S. and Liu, Q. (2002), “Semantic Computation in a Chinese Question-Answering System,” Journal of Computer Science&Technology(JCST), vol.17, No.6, pp. 933 – 939.

23. Lian, Y. (2002), “E-mail Filtering,” Master thesis, University of Sheffield, Department of Advanced Software Engineering.

24. Liang, C.Y., Guo, L., Xia, Z.J., Nie, F.G., Li, X.X., Su, L., and Yang, Z.Y. (2006), “Dictionary-based text categorization of chemical web pages,” Information Processing and Management Volume: 42, Issue: 4.

25. Ma, L.P., Shepherd, J. and Zhang, Y.C.h. (2003), “Enhancing text classification using synopses extraction,” In Proceeding of the fourth international conference on web information systems engineering, pp.

115–124.

26. Ma, W.Y. and Chen, K.J. (2003), "Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff," Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, pp. 168-171.

27. Seidl, T. and Kriegel, H. (1998), “Optimal Multi-Step k-Nearest

29. Tsay, J.J., and Wang, J.D. (2000), “Design and Evaluation of Approaches to Automatic Chinese Text Categorization,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 5, No. 2, pp. 43-58.

30. Wang, Z.W., Wong, S.K.M. and Yao, Y.Y. (1992),”An Analysis of Vector Space Models Based on Computational Geometry,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 152-160.

31. Witten, I.H. and Frank, E. (2005), Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Morgan Kaufmann Series in Data Management Systems.

32. Yang, Y.M. and Liu, X. (1999), "A re-examination of text categorization methods," Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), pp. 42-49.

7 附錄

project util classify master

project util dataoupt 將資料庫資料匯出為文字檔

處理DISTRIBUTIONAL CLUSTERING 的方法，要使用

project util setcorpus master

project

設定使用那幾類以及訓練資料和測試資料數量

util Setdatatype master

project util SetOutputWekaData

將資料依據單一向量以先前所設定訓練及測試數量匯出成 WEAK 資料

master

project util Sort_Processor 放置排序的方法 master

project util SYNONYM_LIST_PROCESSOR 同義詞的處理 master

project util util_filewriter 把資訊寫入檔案的處理類別將文字傳到CKIP 伺服器會傳回XML 檔

master

project Word_Segment Ckip_client

主要處理斷詞的類別，它會呼叫斷詞並解析XML 並儲存結果

master

project xmlprocess StorageXMLDoctoDB

project master

project GUI Frame1 介面設計的類別

表 7-2 流程說明

步驟說明

將SQL SEVER 新增資料庫名為「masterdb」

masterdb 點選資料庫備份的檔案進行還原 2

「進行資料載入」：要點選masterproject.util.setcorpus 類別(資料設定是存放一個資料夾中，而每一類的資料會存放在類別名稱的子目錄中)。要將 setcorpus 中的 corpus_path 設定資料存放的目錄，最後，執行 setcorpus 的主程式以進行資料匯入。

「進行資料預處理」：執行masterproject.xmlproces.StorageXMLDoctoDB 的主程式以進行斷詞、斷句動作。

進行分類處理 5

表 7-3 程式指令說明 MasterProjectSKNN 主程式為 classify，

第二參數是決定使用資料量的比例，如100 則是資料全部使用，

第一參數則是依數第二參數決定訓練資料比例，如第二是100 時，第一是 30,則是有

30％為訓練資料，有 70％為測試第三參數為摘要比例

第四參數為向量設定第五參數為分類器設定第六參數為維度設定四五參數配對法

1 1 ：文件向量分類器時，設定的向量，是依摘要扣除停用字所剩下的字所組成的向量，第六參數是無用處，但還是要下，可以下 0

2 1 ：文件向量分類器時，設定的向量，可依第六參數來決定維度 M，M 是由選出各類別選取前TFIDF 值高的 M/C 的字，之後再組合單一向量空間

3 2 ：類別向量分類器時，設定的向量，是依摘要扣除停用字所剩下的字所組成的

在文檔中以自動摘要提昇中文文件分類之效能 (頁 77-89)