Contemporary technology has made plagiarism easy but hard to restrain. Since deliberate plagiarism will always occur, plagiarism detection is necessary to fend off potential candidates, who are afraid of being caught. This research proposed implementation of ROUGE, which was previously applied in the fields of automatic evaluation of summaries and machine translation evaluation (n-gram co-occurrence statistics). WordNet was introduced to integrate with ROUGE in order to handle as many forms of plagiarism as possible. Google was included in this research to experiment the reliability of obtaining mutual information between two words from the Internet.
Although the system is still a prototype with only a few fundamental functions, it serves as a start in developing a comprehensive tool that will help fight against plagiarism. At the same time, hopefully the system can be further developed for educational purpose by adding warnings messages and explanations regarding the detection results. By providing explanations according to the types of plagiarism, users (including students) can better understand plagiarism and know how to avoid it with real examples.
Through the analysis of the experimental results, the proposed methods are proven to have research value in the field of plagiarism detection. Each method has its strengths in dealing with certain types of plagiarism; while at the same time, each has its weaknesses in other situations. ROUGE is capable of detecting verbatim copy, and the efficiency is acceptable when comparing two complete documents. Unigram is not bounded by the “in-sequence” constraint like LCS and other n-grams, but conversely, it is may be prone to false positive. Other n-grams are stricter in matching tokens and therefore they have higher precision. While LCS and skip-bigram take a middle
ground because both allow skips when scanning through a sentence but require matching tokens to be in-sequence. WordNet extends the capability of the system by digging into the semantic aspect of words so that matching is not just by exact match, but also by the meanings of the tokens.
Being a prototype, it means that there is room for improvement. For instance, there are a few possible areas for future work of the current system. First, further confirmation on the performances of the proposed methods and recommended settings can be achieved by running tests with a larger and more diversified corpus. The corpora used in this research are relatively small compared to other research, especially the number of true positives in the abstract set. The results of the experiments could be affected by the nature of the corpora. However, a valid and compatible plagiarism corpus can be hard to find. Most likely, building a corpus may be an option but a lot of efforts and time will be required. Second, the proposed methods were tested and evaluated separately. Since each method has its strengths and weaknesses, particular combination of the methods may produce better results than the results of each individual method. To combine different methods together, a weighting scheme should be developed so that the score of each method contributes in the right proportion and the final score at the end accurately represent the methods involved. Third, to overcome the problem of splitting or integrating original sentence(s) into one or more sentence(s), chunk comparison may be a worthy attempt.
The inclusion of neighboring sentences and comparison of these sentences as two chunks should be able to solve this loophole. However, neighboring irrelevant sentences may lower the similarity between two chunks and result in false negative judgment. Application of n-gram to chunk comparison can make the method more robust. New chunks can be formed with in-sequence consecutive sentences. Such
Meanwhile, the efficiency of WordNet-based methods and Google MI may be improved by constantly increasing the size of the databases where information between two words are stored.
The above are some advices for future work. Hopefully the initial attempt of applying ROUGE with WordNet and any subsequent research will be of any help in the field of plagiarism detection.
Bibliography:
[1] Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy Detection Mechanisms for Digital Documents. ACM SIGMOD Record, vol. 24, no. 2, 398 – 409.
[2] Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic Clustering of the Web. Computer Networks and ISDN Systems, vol. 29, no. 8-13, 1157 – 1166.
[3] Brown Corpus Manual:
<http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM>
[4] Buckley, .C, Walz, J., Cardie, C., Mardis, S., Mitra, M., Pierce, D. & Wagstaff, K.
(1996). The Smart/Empire TIPSTER IR System. In Proceedings of a Workshop on held at Baltimore, Maryland (pp. 107-121). Baltimore, Maryland.
[5] Campbell, D. M., Chen, W. R., & Smith, R. D. (2000). Copy Detection Systems for Digital Documents. In Advances in Digital Libraries, 2000. ADL 2000.
Proceedings. IEEE (pp. 78-88). Washington, DC, USA.
[6] Chowdhury, A., Frieder, O., Grossman, D. & McCabe, M. C. (2002). Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems, vol. 20, no. 2, 171 – 191.
[7] Collberg, C., Kobourov, S., Louie, J., & Slattery, T. (2003). SPlaT: A System for Self-Plagiarism Detection. Proceedings of IADIS International Conference WWW/Internet 2003, vol. 1, 508 – 514. Algarve, Portugal.
[8] DeVoss, D., & Rosati, A. C. (2002). “It wasn’t me, was it?” Plagiarism and the Web. Computers and Composition, 19, 191 – 203.
[9] Dierderich, J. (2006). Computational Methods to Detect Plagiarism in Assessment.
Information Technology Based Higher Education and Training (pp. 147-154).
[10] Heintze, N. (1996). Scalable Document Fingerprinting. In Proceedings of the Second USENIX Workshop on Electronic Commerce. Oakland, California.
[11] Hoad, T. C., & Zobel, J. (2003). Methods for Identifying Versioned and
Plagiarized Documents. Journal of the American Society for Information Science and Technology, 54(3), 203 – 215.
[12] Iyer, P. & Singh, A. (2005). Document Similarity Analysis for a Plagiarism Detection System. 2nd Inidan International Conference on Artificial Intelligence (pp.
2534-2544). Pune, India.
[13] Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of Rocling X International Conference 1997 Research on Computational Linguistics. Taiwan.
[14] Kang, N.-O., Gelbukh, A., & Han, S.-Y. (2006). PPChecker: Plagiarism Pattern Checker in Document Copy Detection. Lecture Notes in Computer Science, vol. 4188, 661 – 667.
[15] Kappa Statistics: <http://www.dmi.columbia.edu/homepages/chuangj/kappa/>
[16] Kappa Statistics – Table:
<http://www.dmi.columbia.edu/homepages/chuangj/kappa/>
[17] Khmelev, D. V. & Teahan, W. J. (2003). A Repetition Based Measure for Verification of Text Collections and for Text Categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 104-110). Toronto, Canada.
[18] Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries.
In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004 (pp. 74-81). Barcelona, Spain.
[19] LingPipe: <http://alias-i.com/lingpipe/index.html>
[20] LingPipe – Sentence Detection:
<http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html>
[21] Manber, U. (1994). Finding Similar Files in a Large File. In Proceedings of the USENIX Winter 1994 Technical Conference (pp. 2-2). San Francisco, California.
[22] Maurer, H., Kappe, F., & Zaka, B. (2006). Plagiarism – A Survey. Journal of Universal Computer Science, vol. 12, no. 8, 1050 – 1084.
[23] McGregor, J. H., & Williamson, K. (2005). Appropriate Use of Information at the Secondary School Level: Understanding and Avoiding Plagiarism. Library &
Information Science Research, 27, 496 – 512.
[24] Meyer Zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism Detection without Reference Collections. Advances in Data Analysis, vol. v, 359 – 366.
[25] Papineni, K., Roukos, S., Ward, T., & Zhu, W. -J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311-318). Philadephia, USA.
[26] Raveendranathan, P. (2005). Identifying Sets of Related Words from the World Wide Web. The Faculty of the Graduate School of the University of Minnesota.
[27] Shivakumar, N. & Garcia-Molina, H. (1995). SCAM: A Copy Detection Mechanism for Digital Documents. In Proceedings of the Second International Conference in Theory and Practice of Digital Libraries. Austin, Texas.
[28] Si, A., Lenong, H. V., & Lau, R. W. H. (1997). CHECK: A Document Plagiarism Detection System. In Proceedings of the 1997 AMC Symposium on Applied
Computing (pp. 70-77). San Jose, California.
[29] Stein, B., & Meyer Zu Eissen, S. (2006). Near Similarity Search and Plagiarism Analysis. From Data and Information Analysis to Knowledge Engineering, vol. 10, 430 – 437.
[30] TERabyte RetrIEveR: <http://ir.dcs.gla.ac.uk/terrier/>
[31] Wikipedia – Brown Corpus: <http://en.wikipedia.org/wiki/Brown_Corpus>
[32] Wikipedia – Hash Function:
<http://en.wikipedia.org/wiki/Hash_function>
[33] Wikipedia – Stemming: <http://en.wikipedia.org/wiki/Stemming>
[34] Wikipedia – Suffix Tree: <http://en.wikipedia.org/wiki/Suffix_tree>
[35] WordNet: <http://wordnet.princeton.edu/>
[36] WordNet – Related Projects (Java): <http://wordnet.princeton.edu/links#Java>
[37] Zaslavsky, A., Bia, A. & Monostori, K. (2001). Using Copy-Detection and Text Comparison Algorithms for Cross-Referencing Multiple Editions of Literary Works.
Lecture Notes in Computer Science, vol. 2163, 103 – 114.