分類法比較

這個章節要比較不同的分類法的專利文件分類結果：kNN 分類法以及 GIS 分類法。在前面使用的是為向量空間模型分類法。向量空間模型分類法和 kNN 分類法是常用的分類方法，此二個分類方法主要不同於在分類時所要比對的特徵表示（representative）的個數，向量空間模型分類法的分類類別個數為比對相似 度依據，每一個特徵表示就代表一個分類類別，而 kNN 分類法為 Instance-based

的分類法，用到特徵表示個數為訓練語料的文件數，每一特徵表示為一個文件的表示。

kNN 分類法的概念是選擇和測試文件最相似的 k 篇文章，由這 k 篇文章來

決定測試文件的類別，Kin [2005]的實驗裡認為 k = 10 的情形下已有足夠的資訊來辦識類別，由於文件屬於多個類別，因此被選取的 10 個文件中選擇同時 n 篇文件以上所屬的類別為測試文件的類別，n 太小會選擇過多的文件，n 太大會有許多文件的類別太少甚至沒有。利用調適語料決定 n 的大小，在這裡測試結果 n

= 3，也就是只要有類別包含這被選取 10 篇中任意 3 篇文件，該類別就會被認為是測試文件的類別。

A generalized instance set（GIS）algorithm [Lam, 1998] 的概念是將數個相似文件的特徵表示結合成一個特徵表示表示，所以 GIS 分類法的特徵表示個數會少 於（或等於）kNN 分類法，和 kNN 分類法、向量空間模型分類法之間的特徵表 示個數關係為：|D|＝ kNN 分類法≧GIS 分類法≧向量空間模型分類法＝|C|，|D|

表示訓練語料的文件數，|C|表示分類類別的個數。

利用 GIS 的概念，把分類完全一樣的文件形成一個類別，如表 4-11中 D1、

D

2和 D4三篇文章的類別為{A、B、C}，就將此三篇文件分為一類，而 D3、D5

和 D6 的類別和其他的類別不同，就各自形成一個類別，本來只有{A、B、C}三 個類別，GIS 的概念下就形成{{A、B、C}、{A、B}、{A}、{B、C}}四個類別，

在分類比較時需要 4 個特徵表示個數，測試文件只需和此 4 個 GIS 類別的特徵表示比較，最相似的類別就為測試文件的類別。

訓練文件 分類類別

D1 {A、B、C}

D2 {A、B、C}

D3 {A、B}

D4 {A、B、C}

D5 {A}

D6 {B、C}

表 4-11：訓練文件-GIS 類別範例

以語料 2 做測試，表 4-12為不同的分類法和不同程度的詞彙分群個數之間 調和平均值的比較，kNN 分類法的特徵表示個數都遠比 CIS 分類法或是向量空 間模型分類法來得多。向量空間模型分類法在 600 個詞彙群組數時表現得比其他分類好來得差，但權重修正後調和平均值由 0.215 提昇至 0.338，不僅改善最多，

調和平均值也比其他二者分類法高。

此三種分類法比較上來看，以表 4-12中 600 個詞彙群組數無做權重修正的調和平均值數據顯示，當其分類法的特徵表示代表的文件數越多，對於詞彙群組數的大小越會影響分類的調和平均值，尤其當詞彙群組數低於一個門檻之後，調和平均值會急遽下降（特徵表示代表的文件數越多，調和平均值下降越快）。而當詞彙群組數影響越大時，詞彙權重的修正的效果越大。

分類法

kNN GIS

向量空間模型 分類法

特徵表示個數 7496 1636 25

詞彙群組數 詞彙權重修正前 / 後

無分群 0.321 / 0.316 0.339 / 0.349 0.345 / 0.376 2000 詞彙群組數 0.318 / 0.330 0.300 / 0.350 0.348 / 0.366 600 詞彙群組數 0.274 / 0.313 0.243 / 0.300 0.215 / 0.338

表 4-12：分類法和詞彙分群之間的比較(調和平均值)

表 4-13顯示不同的分類法和不同程度的詞彙分群個數之間平均每篇文件的執行秒數，GIS 分類法和向量空間模型分類法在 600 詞彙群組數時比起沒有分群 能減少一半的執行時間，而 kNN 分類法因為要比對的特徵表示個數太多，經過 詞彙分群還是不能有效的節省時間，僅減少原來 5%的時間。

分類法 詞彙群組數

kNN GIS

向量空間模型 分類法

無分群 8.025 1.142 0.323 2000 詞彙群組數 7.853 0.905 0.259 600 詞彙群組數 7.68 0.463 0.151

表 4-13：分類法和詞彙分群的執行時間(秒)

第 5 章結論和未來工作

在本論文中，主要針對實際美國專利文件分類的情形，以主類別和次類別為分類類別的語料個別進行測試，對於專利分類上利用 Distributional Clustering 有效的減少文字的向量空間維度，以主類別為分類類別的語料可以在 2995 到 1500 之間的詞彙群組數都有接近 0.78 的調和平均值，在沒有降低調和平均值下，減少一半的文字維度，以次類別為分類類別的語料詞彙群組數 8000 到 2000 之間未減少調和平均值減少至四分之一文字的向量空間維度。

在上述執行詞彙分群之後，若選擇更少的詞彙群組數就會造成調和平均值急速下降，因此對於分類的權重計算公式做修正，以改善因為過少的詞彙群組數造成 tf 值快速增加的問題，提高在少量的詞彙群組數之調和平均值，權重修正過後，以主類別為分類類別的語料在 200 個詞彙群組數情形下調和平均值為 0.73，

提升 86% 的調和平均值，以次類別為分類類別的語料到 600 個詞彙群組數以上調和平均值均有 0.35 以上，提升 70%的調和平均值，同時也減少一半的執行時間，對於龐大的專利文件語料大大有效節省系統資源。

以語料 2 做測試，我們的分類法和 kNN 分類法、CIS 分類法在詞彙分群個 數之間比較，我們的分類法在無分群和 2000 個詞彙群組數時表現其他二者來得好，600 個詞彙群組數時表現得比其他分類法來得差，但權重修正後調和平均值還是比其他二者好。當其分類法的特徵表示代表的文件數越多，詞彙群組數的大小對於分類的調和平均值影響越大；當詞彙群組數影響越大時，詞彙權重的修正的效果越大。

在處理專利文件過程中，有少數的文件的類別標記是不存在的，語料 2 中

13882 篇的專利文件中發現 25 篇有 { 707/102.1、707/103、707/104、707/107.1、

707/500、707/501、707/501.1、707/513、707/514、707/516、707/707}的類別標記是不存在的，這樣的錯誤比例相當低，但不能排除其他的文件類別標記錯誤產生，或是因為人工分類方式，不同的專家主觀造成的差異因素，而影響分類結果的正確性。

在實際應用在專利的分類系統中，尚有許多需要改進的地方，包括未探討各個欄位對於分類的影響，若能把對分類有幫助的欄位（如發明人、申請公司…等）

加入文件分類參考的依據，對於分類也許可再提高正確性或加速分類進行。對專利內容的詞彙做篩選或做詞彙語意的延伸，如剔除與主題無關的詞彙和加入其他在文件中沒出現的相關詞彙，加強確認詞彙對於主題的重要性。因為專利文件的數量龐大，系統的效率更顯重要，不論是執行的時間、所佔用的系統資源…等，

都關係於日後使用的方便性。

在可正確進行專利文件分類之後，希望可以自動地調整分類類別階層結構，

新的主題產生時能自動給予新的分類項目並且分配到類別結構中正確的位置，當其中一個類別裡的文件數過多時，能再自動做細分類，將該類別內專利文件重新分配，做到完全的專利文件分類自動化。

參考文獻

Baker, L. D., McCallum, A.K., Distributional clustering of words for text classification, Proceedings. of SIGIR, pp. 96-103, 1998.

Belkin, N. J., Croft, W. B., Information filtering and information retrieval: two sides of the same coin?, Communications of the ACM, 35, 12, 29–38, 1992.

Chakrabarti, S., Dora, B., Agrawal, R., Raghavan, P., Using taxonomy, discriminants, and signatures for navigating in text databases, In Proceedings of the 23rd

International Conference on Very Large Data Bases (VLDB), pp. 446-455, 1997.

Chakrabarti, S., Dora, B., Agrawal, R., Raghavan, P., Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies, In Proceedings of the 23rd International

Conference on Very Large Data Bases (VLDB), 163-178, 1998.

Chen, W., Chang, X., Wang, H., Zhu, J., Yao, T., Automatic Word Clustering for Text Categorization Using Global Information, First Asia Information Retrieval Symposium (AIRS), pp.1-6, 2004.

Deerwester, S.C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A., Indexing by latent semantic analysis, Journal of the American Society for

Information Science, Vol. 41, No. 1, pp. 391-407, 1990.

Dhillon, I. S., Modha, D. S., Concept Decompositions for Large Sparse Text Data Using Clustering, Machine Learning, Vol. 42, No. 1, pp.143—175, 2001.

Fung , B., Wang , K., Ester, M., Hierarchical Document Clustering Using Frequent Itemsets, Proceedings of the SIAM International Conference on Data Mining, 59-70, 2003.

Hammouda, K., Kamel, M., Document similarity using a phrase indexing graph model, Knowledge and Information Systems, Vol. 6, No. 6, pp. 710-727, 2004.

Jing, L.; Huang, H.; Shi, H., Improved Feature Selection Approach TFIDF in Text Mining, Proceedings International Conference on Machine Learning and

Cybernetics, Vol. 2, pp. 944-946, Beijing, 2002.

Kang, B. Y., Lee, S. J., Document indexing: a concept-based approach to term weight estimation, Information Processing and Management, 41(5): 1065-1080, 2005.

Kin, J. H., Huang, J. X., Jung, H. Y., Choi, K. S., Patent Document Retrieval and

Classification at KAIST, In Proceedings of NII-NACSIS Test Collection for IR

Systems Workshop, pp 6-9, 2005.

Koster, C. H. A., Seutter , M., Beney, J., Classifying Patent Applications with Winnow, Proceeding Benelearn Conference, pp. 19-26, 2002.

Lam, W., Using a generalized instance set for automatic text categorization. In

Proceedings of the 21th annual international ACM SIGIR, pp. 81-89, 1998.

Larkey, L. Some issues in the automatic classification of U.S. patents, In Working

Notes for the AAAI-98 Workshop on Learning for Text Categorization, 1998.

Larkey, L. S., A patent search and classification system, In the Fourth ACM

Conference on Digital Libraries, pp. 79-87, 1999.

Lavelli, A., Sebastiani, F., Zanoli, R., Distributional Term Representations: An Experimental Comparison, In Proceedings of the Thirteenth ACM conference on

Information and knowledge managment, pp. 615-624, 2004.

Lertnattee, V., Theeramunkong, T., Multidimensional Text Classification for Drug Information, IEEE Transactions on Information Technology in Biomedicine, Vol.

8, No. 3, pp. 306-312, 2004.

Lin, D., Using syntactic dependency as local context to resolve word sense ambiguity, In Proceedings of ACL/EACL-97, pp. 64–71, 1997.

Ma, L., Chen, Q., Cai, L., An Adaptive System for Online Document Filtering, IEEE

International Conference, Vol. 5, pp. 4712- 4717, 2003.

Mandhani, B., Joshi, S., Kummamuru, K., A Matrix Density Based Algorithm to Hierarchically Co-Cluster Documents and Words, Proceedings of the

international conference on World Wide Web,

pp. 511-518, 2003.

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K. J., WordNet: An On-line Lexical Database, International Journal of Lexicography, Vol. 3, No. 4, 1990.

Pereira, F., Tishby, N., Lee, L., Distributional clustering of English word,

Porceedings of the 31st Annual Meeting of the Association for Conputational Linguistics, pp. 183-90, 1993.

Richter, G., MacFarlane, A., The impact of metadata on the accuracy of automated patent classification, World Patent Information, Vol. 27, pp. 13-26, 2005.

Shah, C., Chowdhary, B., Bhattacharyya, P., Constructing Better Document, Vectors Universal Networking Language (UNL), Proceedings of International Conference

on Knowledge-Based Computer Systems (KBCS), 2002.

Wang, W., Do, D. B., Lin, X., Term Graph Model for Text Classification, Advanced

Data Mining and Applications, pp. 19-30, 2005.

Wu, C., Liu, C. L., An exploration of topic-dependent feature-weighting for summary extraction. Proceedings of the 2003 National Computer Symposium (NCS),

Taiwan, pp. 18-19, 2003

Zhang, Y., Heywood, N. Z., Milios, E., Narrative Text Classification for Automatic Key Phrase Extraction in Web Document Corpora, Proceedings of the 7th annual

ACM international workshop on Web information and data management WIDM,

pp. 51-58, 2005.

李駿翔，應用資料探勘分類技術於專利分析之研究，中原大學資訊管理研究所碩士論文，2003。

附錄一美國專利文件範例

和欄位說明

專利編號（Patent Number）：每一篇專利文件都會給予一個編號。

核准日期（Date of Patent）：專申請核准的日期。

發明名稱（Title）：一個簡短明確符合發明的主題。

發明摘要（Abstract 或 Abstract of the Disclosure）：簡明扼要地描述發明的技術內容，以 150 字為限。

發明人（Inventors）：發明人的名字。

受托人（Assignee）：財產保管人的名字。

申請日期（Filing Date）：專利文件申請的日期。

美國分類編號（UPC）：依 UPC（United States Patent Classification）標準的分類號碼，依機能為分類，分類仔細，是世界上專利文件分類最細的標準。

國際分類編號（IPC）：依 IPC（International Patent Classification）標準的分類號碼。

參考引證（References Cited）：所引用的參考資料。

請求項（Claims）：明確定義發明者所要求保護之發明範圍。

詳細說明（Detailed Description）：專利發明的實例詳細說明。

發明背景（Background of the Invention）：內容為發明相關領域和背景技術之說明，描述所涉及相關技術的領域，和目前的技術缺點以及待解決之問題。

圖示之簡單說明（Brief Description of the Drawings）：簡單的描述所附圖示之內容。

發明總覽（Summary of the Invention）：發明專利的內容概述或是描述要求項的內容。

在文檔中專利文件之自動分類研究 (頁 45-55)