結論與建議 - 以Google搜尋引擎為基礎之中文剽竊偵測系統

5.1 結論

本研究運用Google AJAX Search API建置一中文剽竊偵測系統，將Google搜尋引擎機制當成第一階段的相似度篩選，自動化偵測使用者上傳之文件是否有抄襲自網路的嫌疑，並以視覺化系統介面呈現給使用者，利於人工檢視系統偵測結果。

本研究首先會將使用者上傳之文章，經CKIP斷成句子後，一句一句傳入 Google AJAX Search API，分析傳回之JSON編碼結果，取其摘要資訊做第二階段之LCS相似度比對，若高於本研究所設定的門檻值，即顯示於系統上，做後續的

確率之因素，包含文章含有英文字，Google Page Rank、編碼問題，

這是後續可改善的方向。

剽竊偵測系統的主要目的是去嚇阻進行剽竊行為的惡意人士，使文件或文章作者得到應有的智慧財產權保護；還有讓學生以自我檢視的方式，了解到怎樣的寫作方式可能會觸犯他人的權益，提高自我的警覺心。雖然透過不斷改進的剽竊偵測系統，如商業化 Turnitin[20]、Docol© c[7]，其偵測績效愈來愈佳，甚至在 2009 九月繁體中文版本的剽竊系統引進台灣，進行販賣，許多大學也著手準備將其引進大學教育中，但僅有監測並無法完全杜絕剽竊行為的產生，而且其偵測

來可透過Sun[18]等人所提出的方式，基於中文字本身之結構，將其中文字的位置關係轉換成運算子、字的結構轉換成運算元的數學表示式，來減少分析中文字編碼時的錯誤，增加系統的偵測正確率。

III. Google搜尋引擎參數調整

利用Google搜尋引擎查詢的特性，將使用者上傳的文章經CKIP斷成的句子加上雙引號，傳入Google搜尋引擎，傳回精確搜尋的結果，雖無法傳回部份相似的相關結果，但可改善部份相似時所造成的假警報的問題。

本研究僅取Google搜尋引擎傳回的前四筆結果，未來可增加取得的筆數，改善Google Page Rank造成的問題。

參考資料

[1] Apache POI, http://poi.apache.org/.

[2] Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. (1997). Syntactic Clustering of the Web. Computer Networks and ISDN Systems, vol. 29, no. 8, 1157 – 1166.

[3] Brin, S., Davis, J., and Garcia-Molina, H. (1995). Copy Detection Mechanisms for Digital Documents. ACM SIGMOD Record, vol. 24, no. 2, 398 – 409.

[4] Chowdhury, A., Frieder, O., Grossman, D., and McCabe, M. C. (2002). Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on

Information Systems, vol. 20, no. 2, 171 – 191.

[5] CNN.com, http://edition.cnn.com/.

[6] Cormen, T. H., Leiserson C .E., and Rivest R. L. (1989) Introduction to Algorithms. The MIT Press.

[7] Docol©c,http://www.docoloc.de/.

[8] Dierderich, J. (2006). Computational Methods to Detect Plagiarism in Assessment.

Information Technology Based Higher Education and Training, pp. 147–154.

Sydney, Australia.

[9] Google AJAX Search API, http://code.google.com/intl/en/apis/ajax/.

[10] Lin, C.-Y. (2004) ROUGE: A Package for Automatic Evaluation of Summaries.

In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, pp. 74-81. Barcelona, Spain.

[11] Manber, U. (1994) Finding Similar Files in a Large File System. In Proceedings

of the USENIX Winter 1994 Technical Conference, pp. 2-2. San Francisco,

California.

[12] McCuen,R.H. (2008)The Plagiarism Decision Process:The Role of Pressure and

Rationalization. IEEE Transactions on Education, vol. 51, no. 2, 152–156.

[13] Papineni, K., Roukos, S., Ward, T., and Zhu, W. -J. (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th

Annual Meeting of the Association for Computational Linguistics, pp. 311--318.

Philadephia, USA.

[14] Plagiarism.org, http://www.plagiarism.org/

[15] Rabin, M. O. (1981) Fingerprinting by Random Polynomials. Center for

Research in Computing Technology, Harvard University, Report TR-15-81.

[16] Shivakumar, N., and Garcia-Molina, H. (1995) SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second International

Conference in Theory and Practice of Digital Libraries, Austin, Texas.

[17] Stein, B., and Meyer Zu Eissen, S. (2006) Near Similarity Search and Plagiarism Analysis. Data and Information Analysis to Knowledge Engineering, vol. 10, 430 –437.

[18] Sun, X. M., Chen, H. W., Yang, L. H., and Tang, Y. Y. (2002)Mathematical Representation of a Chinese Character and its Applications. International

Journal of Pattern Recognition and Artificial Intelligence, pp.735--747.

[19] Stepchyshyn, Vera, and Nelson, Robert S. (2007) Library plagiarism policies.

Association of College and Research Libraries, p. 65.

[20] TurnItIn, http://www.turnitin.com/.

[21] 中學生網站,

http://www.shs.edu.tw/essay/。

[22] 中文斷詞系統，http://ckipsvr.iis.sinica.edu.tw/。

[23] 王偉全，“文件抄襲偵測”，元智大學資訊管理研究所，碩士論文，2006年。

[24] 陳建穎，”以ROUGE和WordNet為基礎的N-gram共現於剽竊偵測”，國立交通大學資訊管理研究所，碩士論文，2009年。

[25] 資策會，2009年12月底止台灣上網人口，

http://www.find.org.tw/find/home.aspx?page=many&id=219。

[26] 劉奕廷，“以搜尋引擎進行剽竊模式之評估”，國立成功大學工程科學研究所，碩士論文，2007年。

在文檔中以Google搜尋引擎為基礎之中文剽竊偵測系統 (頁 51-56)