Sentence Level Plagiarism Detection - 利用語彙、句法以及語義資訊偵測網路抄襲

Chapter 3 Methodology

3.2 Sentence Level Plagiarism Detection

Since not all outputs of a search engine contain an exact copy of the query, we need a model to quantify how likely each of them is the source of plagiarism. For better efficiency, our experiment exploits the snippet of a search output returned by Google to represent the whole document. That is, we want to measure how likely a snippet is the plagiarized source of the query. We designed several models which utilized rich lexical, syntactic and semantic features to pursue this goal, and the details are discussed below.

3.2.1 Ngram Matching (NM)

One straightforward measure is to exploit the ngram similarity between source and target text. Given two sentences, S(source) and T(target), we first enumerate all ngrams in S, and then calculate the amount of duplication ngrams with those in T. The ngram similarity can be measured with three different formulas illustrated below in (1), (2), and (3). Note that our matching is based on stemmed ngrams.

It seems that NMavg, is a better fit for our need. However, when the length of the source and target are imbalanced, NM_avg itself cannot reflect the degree of plagiarism very well. We deal with this issue by considering both NMS and NMT with a pre-defined threshold TH=0.5. If min (NM_S, NM_T) > TH, then the NM score will be defined as max (NMS, NMT), otherwise it is NMavg.

For the choice of n, the larger n is, the harder for this feature to detect plagiarism with insertion, replacement, and deletion. According to ([4], Cedeno and Rosso, 2009)’s experiments on the METER¹² corpus, their best results are obtained when considering low level word ngrams comparisons (n={2, 3}). And in our experiment on the sampled PAN-2010 corpus in English as well as the annotated Web document dataset in Chinese, which will both be further introduced in section 4.1, we chose n=2.

3.2.2 Reordering of Words (RW)

Plagiarism can come from the reordering of words. We argue that the permutation distance between S and T is an important indicator for reordered plagiarism. The permutation distance is defined as the minimum number of pair-wise exchange between matched words needed to transform a target sentence, T, into the same order of matched words as a source sentence, S, and Figure 2 below is a simple example.

Figure 2. An example of reordering of words

12 The METER corpus: http://nlp.shef.ac.uk/meter/

As mentioned in ([12], Sörensena and Sevaux, 2005), the permutation distance can be calculated by expressions (4) and (5):

maximum possible distance between S and T, as shown in (7), then the reordering score of the two sentences, expressed as RW(S, T), will be (6):

where

3.2.3 Alignment of Words (AW)

Besides reordering, plagiarists often insert words into a sentence or delete some from it. We tried to model such behavior by finding the alignment of two word sequences. We performed the alignment using a dynamic programming method as mentioned in ([14], Wagner and Fischer, 1975). As shown in Figure 3, a word match earns 2 points, while a word mismatch receives a penalty of -1 points. A gap also gets a penalty of -1 points.

Since each gap may span across more than one word, for each dash symbol(“-”) in gaps covering a word another -1 points will be added. The alignment algorithm tries to

maximize the score of the summation of each point produced by the different types of matching result.

C - A - T A A C T C G G A C A - - T +2 -1 -1 -1 -1 +2 -1 -1 +2 = 0

Alignment score: 0 -1 -1 -1 = -3 Figure 3. Alignment of words

However, such alignment score does not reflect the continuity of the matched words, which can be an important cue to identify plagiarism. To overcome such drawback, we revise the score as below.

where

M is the list of matched words, and Mi is the i^th matched word in M. This implies we prefer fewer unmatched words in between two matched ones. Consider the following case in Figure 4.

Figure 4: An alignment example

ABCD

EEABCDEE

ABEECDEE

--ABCD-- AB--CD--alignment

ABCD alignment

We can tell that when considering aligning “ABCD” with the two patterns,

“EEABCDEE” with the alignment result “--ABCD--” should be more continuous than “ABEECDEE” with the alignment result “AB--CD--”, but by the alignment algorithm, they will get the same scores. By the redefined way of calculation, the two cases are with the score of 3 and 2.33 respectively, and thus in terms of alignment “--ABCD--” is considered more similar based on this measure. As a result, after the Table 3 is a possible false positive case:

Table 3. An example of matched words with different POS and phrase tags S: The man likes the well dressed young woman.

T: The face of the woman in red dress looks like the man’s one.

Word S: POS T: POS S: PT T: PT

VBZ: Verb, 3^rd person singular present IN: Preposition

JJ: Adjective PT NP: Noun Phrase

12 VP: Verb Phrase

PP: Prepositional Phrase ADJP: Adjective Phrase

Therefore, we further explore syntactic features for plagiarism detection. To achieve this goal, we utilize the Stanford Parser¹³ to obtain POS and phrase tags of the words. For simplicity we abbreviate POS tags as POS and phrase tags as PT. Then we design an equation to measure the POS and PT similarity, which is shown below in (10).

We paid special attention to the case when a sentence is transformed from an active form to a passive-form or vice versa. A subject originally in a Noun Phrase can become a Prepositional Phrase, i.e. “by …”, in the passive form while the object in a Verb Phrase can become a new subject in a Noun Phrase. Here we utilize the Stanford Dependency provided by Stanford Parser to match the POS/PT between active and passive sentences. In other words, we handle only 3 kinds of phrase tag : NP, VP, PP.

For all other kinds of phrase tags, our system will assign the word with the "ELSE" tag.

3.2.5 Semantic Similarity (LDA)

Plagiarists, sometimes, replace words or phrases with those that contain similar meanings. While previous works ([6], Li et al., 2006) often explore semantic similarity using lexical databases such as WordNet to find synonyms, we exploit a topic model, specifically Latent Dirichlet Allocation ([1], David M. Blei et al., 2003), to extract the semantic features of sentences. Given a set of documents represented by their word

13 Stanford Parser, a statistical parser: http://nlp.stanford.edu/software/lex-parser.shtml

sequences, and a topic number n, LDA learns the word distribution for each topic and the topic distribution for each document to maximize the likelihood of the word co-occurrence in a document. The topic distribution is often taken as the semantics of a document. We use LDA to obtain the topic distribution of a query and a candidate snippet, and compare the cosine similarity of them as a measure of their semantic similarity. To handle the case that words in the source sentence may be reordered, we have tried another approach by calculating the overlap percentage of LDA tags as the LDA score. The computing details are the same as those illustrated in calculating the NM

score. According to our experiment, the latter approach does perform better.

The details of the training data used to train the LDA models are as follows. For English training data, we use the PAN-2010 Corpus. For Chinese training data, we retrieved 85 review articles from the Web randomly, where 33 of them are book reviews, 32 of them are movie reviews and the rest 20 of them are reviews of music albums.

在文檔中利用語彙、句法以及語義資訊偵測網路抄襲 (頁 14-20)