Data Sets - Experiments and Evaluation - 以ROUGE和WordNet為基礎的N-gram共現於剽竊偵測

4. Experiments and Evaluation

4.1 Data Sets

In the field of plagiarism detection, there is not a standard and valid plagiarism corpus that is publicly available yet. A number of works used news corpus such as the Reuters News corpus for their evaluation, while a small number of works used research articles corpora that are managed by the university and therefore only accessible by the university members. The remaining choice is to make one’s own plagiarism corpus, which usually is relatively small due to limited resources. This research adopted the last approach instead of using a news corpus because even though news content is often reused, modification of this nature may not be able to represent plagiarism.

There are two different manually generated data sets for evaluating the proposed methods. One of the sets contains 978 pairs of sentences while the other set contains 100 pairs of sentences. These two sets will be referred to as the abstract set and the paraphrased set respectively hereafter. The abstract set was used primarily to determine the ideal settings for the methods. It was based on the observation that abstracts of some papers are actually formed by sentences taken from the main text.

Such characteristic may be utilized to simulate the plagiarism scenario by treating the abstract as the candidate of plagiarism and the main text as the source being plagiarized. First, a collection of research papers were retrieved from research databases like Elsevier and EBSCOhost using the query “plagiarism”. Second, the abstract and the main text of each paper were separated and saved in two different plain text files. Some manual efforts were required to remove undesirable texts that appeared in certain parts of the papers as they interfered with the main body of the text and might affect the outcome of the experiment. Third, each abstract sentence

was compared with the main text using six different methods namely n-grams (unigram to 4-gram), LCS and skip-bigram. Top five matching in each method were recorded regardless of the scores. In other words, each abstract sentence produced 30 pairs of sentences with the main text; however, if there were repeated pairs within the 30 pairs, only one of them would be included in the abstract corpus. There were 1000 unique pairs of sentences out of 19 research papers in the end. Fourth, four people who had been educated about plagiarism and understood the concept of plagiarism were asked to annotate the abstract corpus. The sentence pairs were randomly divided into four groups and each person was given 500 pairs so that each pair would be annotated by two different persons. After the annotation was completed, kappa statistics [15] was applied to ensure the reliability and validity of the annotation by measuring the agreement between the annotators. In the end, the groups had a minimum kappa score of 0.696 and a maximum score of 0.863, with two groups falling into the range of substantial while the other two groups falling into the range of almost perfect. Table 1 below serves as a general reference for kappa scores:

Table 1 Kappa Statistics [16]

Kappa Strength of agreement

0.00 Poor

0.01-0.20 Slight

0.21-0.40 Fair

0.41-0.60 Moderate

0.61-0.80 Substantial 0.81-1.00 Almost perfect

Sentence pairs which were annotated differently by the annotators were removed, and the end product was 978 pairs of sentences in which only 32 pairs were annotated as candidates of plagiarism (see Appendix 3 for the 32 pairs of sentences). The number of positive plagiarism examples in the abstract corpus was rather disappointing, and further analysis of the 32 pairs showed that majority of the pairs came from seven papers. This observation indicates that some authors do use similar or exact sentences from the main text for their abstract. The small number of valid plagiarizing pairs was due to the fact that selection of the research papers in the beginning was random. Figure 19 is a pie chart that shows the statistics of each plagiarism type in the 32 pairs of sentences. Definitions of the plagiarism types are as follows: Complete Verbatim (Table 2) – two sentences are exactly the same, Substantial Verbatim (Table 3) – two highly similar sentences that differ by only a few words, Lifted Sentences (Table 4) – candidate sentence copied one or more phrases from reference sentence, Paraphrased but Same Key Words (Table 5) – candidate sentence is rather different from reference sentence but contains the same key words.

Statistics of Each Plagiarism Type in Abstract Set

Paraphrased but Same key words

Figure 19 Statistics of Each Plagiarism Type in the Abstract Set

Table 2 Example of Verbatim Copy

Candidate sentence Reference sentence

this article draws on the poststructuralist theory of consumption developed by michel de certeau to consider plagiarism as a tactic deployed by consumers in their attempts to negotiate the demands of an increasingly commodired tertiary education sector

Table 3 Example of Substantial Verbatim

Candidate sentence Reference sentence

it is also concluded that there is a growing need for uk institutions to develop cohesive frameworks for dealing with student plagiarism that are based on prevention supported by robust detection and penalty systems that are transparent and applied consistently

there is a growing need for uk institutions to develop cohesive frameworks for dealing with student plagiarism that are based on prevention supported by robust detection and penalty systems that are transparent and applied consistently

Table 4 Example of Lifted Sentences

Candidate sentence Reference sentence

this paper reviews the literature on plagiarism by students much of it based on north american experience to discover what lessons it holds for institutional policy and practice within institutions of higher education in the uk

conclusion there is an extensive literature on plagiarism by students particularly in the context of north america experience but it clearly holds important lessons for institutional policy and practice within institutions of higher education in the uk

Table 5 Example of Paraphrased but Same Key Words

Candidate sentence Reference sentence

those who plagiarized least incorporated direct quotations more effectively used fewer quotations and synthesized

information and ideas better than did the others

the two students who plagiarized least used minimal quotations see table 1 and used them effectively capably

synthesizing their information and ideas a challenge in a task that required primarily reporting of information

The initial motivation for generating the paraphrased set was to add more plagiarizing examples. One possible way is to retrieve plagiarism examples from the Internet, where Websites focusing on the topic of plagiarism can be found. In those Websites, usually one can find plagiarism examples in short passages of about a few sentences long. Hence, the query “plagiarism examples” was sent to Google, and only paraphrased plagiarism examples were retrieved manually. Paraphrased set mainly consists of plagiarism types like addition, deletion or substitution of words in the original content, change of sentence structure, and partial verbatim copy. A total of 30 plus plagiarism examples were retrieved. All the examples were retrieved from the first 10 pages of the returned search results because repeated examples appeared after the first few pages of search results and the relevancy of search results began to decrease. The plagiarizing sentences were paired up with the corresponding original

sentences manually; therefore each pair was a valid example of plagiarism.

在文檔中以ROUGE和WordNet為基礎的N-gram共現於剽竊偵測 (頁 51-56)