• 沒有找到結果。

針對現有三大類網頁區塊擷取方法的不足,以及能處理非結構化網頁,本論文提出 以賽局為基礎的主題區塊擷取方法,將網頁轉成 HTML DOM Node,並針對每個元件所 在位置、特性,計算出資訊含量,再用賽局方法來決定是否要形成主題區塊。具有相似 內容的主題區塊會一起呈現,並轉換成易於儲存、檢索與分析的結構化資料,便於往後 的應用。

實驗結果證明,GRAB 具有不錯的效能,尤其是在新聞領域更具效果。與現有方法 比較之後,也證明 GRAB 具有較好的效能,而且處理非結構化網頁的效果很好。

本論文提出的 GRAB 演算法,是以 W3C 定義之 DOM Node Type 為基礎,並進一 步擴充定義兩種 Node,做為計算資訊含量的基礎,結合眼球追踨、網頁元件特性、以 及賽局理論的方法,能讓資訊含量的計算更符合讀者在閱讀網頁的行為模式,賽局的方 式能改變部份主題區塊原本的決策,找到一個讓兩位玩家皆滿意的決策,並能減少擷取 出來的主題區塊數目,也就能減少產生主題性不足的主題區塊。

未來研究的方向整理如以下幾點:

(1)擴大處理能力

目前本論文的 GRAB 演算法,針對 HTML 網頁有很好的效果,但對於包含許多 CSS、JavaScript、<DIV>、或以 Flash 撰寫的網頁,效果就較不理想。現今也愈來愈多 網頁利用 CSS、JavaScript、<DIV>標籤來建置網頁,因此未來可加強在這部份的處理。

(2)結合語意網

除了在能夠準確的擷取使用者有興趣的網頁主題資料之外,希望更可以建立主題之 間的相關性,進而可以利用語意網路的建模方法建置使用者可能會有興趣的相關主題,

提供更有效率的資料擷取與探勘的方法。

(3)語意相似計算

目前在計算主題區塊相似度的方法是採用 LCS,只考慮文字。未來可以設計成語意 相似計算,讓主題區塊樹的整併可以更精確。

參考文獻

英文部份:

[1] B. Chidlovskii, “Wrapper Generation by k-Reversible Grammar Induction”, In ECAI2000 workshop on Machine Learning for Information Extraction, 2000. Access from http:

//citeseer.nj.nec.com/469912.html on June 2002

[2] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’03), Page 24-27, 2003.

[3] B. Liu, and Y. Zhai. Web Data Extraction Based on Partial Tree Alignment. In the Proceedings of the 14th international conference on World Wide Web, Page 76-85, 2005.

[4] B. Liu, and Y. Zhai. Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on knowledge and data engineering, vol.18, no.12.

[5] C. A. George, "Usability testing and design of a library website:an iterative approach,

"OCLC Systems & Services: 21:3 (2005): 167-180

[6] C. H. Chang and S. C. Lui. IEPAD: Information Extraction Based on Pattern Discovery.

In Proceedings of the 10th international conference on World Wide Web, Page:

681-688, 2001.

[7] C. N. Hsu, and C. C. Chang. Finite-state Transducers for Semi-Structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Application, Page 38-49, 1999.

[8] C. N. Hsu, and M. T. Dung. Generating Finite-state Transducer for Semi-Structured Data Extraction from the Web. Information Systems, 23(8):521-538, 1998

[9] D. Cai, S. Yu, J. Wen, and W. Ma, "Extracting Content Structure for Web Pages Based on Visual Representation", in Proc. APWeb, 2003, pp.406-417.

[10] D. Cai, S. Yu, J. R Wen, and W. Y Ma. VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Research, Redmond, WA 98052.

[11] G.. Robert. Game Theory for Applied Economists. Princeton Univ Pr, Page 1-11, 1992.

[12] I. Muslea, S. Minton, and C. A. Knoblock. STALKER: Learning Extraction Rules for Semistructured Web-based Information Sources. In Proceedings of AAAI Workshop on AI and Information Integration, Pages 74081, 1998.

[13] J. Han, D. Han, C. Lin, H. Zeng, Z. Chen, and Y. Yu, "Homepage live: automatic block tracing for web personalization", in Proc. WWW, 2007, pp.1-10.

[14] J. Song, D. L. Wang, Y. B. Bao, and D. R. Shen, “collecting and storing web archive based on page block”, Journal of Software, 2008, 19(2), pp.275-290.

[15] J. Wang, and F. H. Lochovsky. Data Extraction and Label Assignment for Web Databases.

In Proceedings of the twelfth international conference on World Wide Web, Page 187-196, 2003.

[16] L. Eikvil, “Information Extraction from world wide web -A Survey-”, Norwegian Computing Center, No. 945, July 1999. Access from http:

//citeseer.nj.nec.com/eikvil99information.html onJune 2002.

[17] N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper Induction for Information Extraction. In Intl.Joint Conference on Articial Intelligence (IJCAI),pages 729-737, 1997.

[18] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness”, Artificial Intelligence, Vol. 118, Iss. 1-2, pp. 15-68, April 2000.

[19] N. Kushmerick, D. S. Weld, R. Doorenbos,“Wrapper Induction for Information Extraction”, Intl. Joint Conferenceon Artificial Intelligence (IJCAI), pp. 729-737, 1997.

[20] S. J. Lim, Y. K. Ng, “Change Discovery of Hierarchically Structured, Order-Sensitive Data in HTML/XML Documents”, Applications and the Internet Proceedings 2004 International Symposium, pp. 178-187, 2004.

[21] S. Lin and J. Ho, "Discovering informative content blocks from Web documents", in Proc. KDD, 2002, pp.588-593.

[22] S. Outing, L. Ruel, "The Best of Eyetrack III:What We Saw When We Looked Through Their Eyes, "<http://www.poynterextra.org/eyetrack2004/main.html>(14 May 2007) [23] S. Yu, D. Cai, J. R. Wen, and W. Y. Ma. Improving pseudo-relevance feedback in web

information retrieval using web page segmentation. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, pp. 11-18, Budapest, Hungary, May 20-24, 2003.

[24] T. Peng, C. Zhang, and W. Zuo, "Tunneling enhanced by web page content block partition for focused crawling", presented at Concurrency and Computation:

Practice and Experience, 2008, pp.61-74.

[25] W. Liu, X. Meng, and W. Meng. Vision-based Web Data Records Extraction. In Proceedings of the 9th SIGMOD International Workshop on Web and Databases (SIGMOD-WebDB2006), Chicago, Illinois, June 30, 2006.

[26] Y. F. Tzeng. The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages.

[27] Y. Kim, J. Park, T. Kim, and J. Choi. Web Information Extraction by HTML Tree Edit Distance Matching. 2007 International Conference on Convergence Information Technology.

中文部份:

[28] 李季壕,動態網頁之樣版與資料分析研究

[29] 李泓儒,淨化網頁:網頁區塊化以及資料區域擷取

[30] 李逸群,網頁異動偵測技術在網際網路新聞資訊擷取上之應用 [31] 姚文鋒,網站內網頁之區塊等級分析

[32] 范綱岷,”使用超本文標記語言剖析樹建構多網頁資訊萃取及融合代理人”

網站部份:

[33] Document Object Model(DOM)-W3C Recommendation. http://www.w3c.org/DOM/

相關文件