A New Data Hiding Method via Revision History Records on Collaborative Writing Platforms

(1)

Records on Collaborative Writing Platforms

^

YA-LIN LEE¹ AND WEN-HSIANG TSAI^{1, 2}, ¹National Chiao Tung University, ²Asia University, Taiwan

A new data hiding method via collaboratively-written articles with forged revision history records on collaborative writing platforms is proposed. The hidden message is camouflaged as a stego-document consisting of a stego-article and a revision history created through a simulated process of collaborative writing. The revisions are forged using a database constructed by mining word sequences used in real cases from an English Wikipedia XML dump. Four characteristics of article revisions are identified and utilized to embed secret messages, including the author of each revision, the number of corrected word sequences, the content of the corrected word sequences, and the word sequences replacing the corrected ones. Related problems arising in utilizing these characteristics for data hiding are identified and solved skillfully, resulting in an effective multi-way method for hiding secret messages into the revision history. To create more realistic revisions, Huffman coding based on the word sequence frequencies collected from Wikipedia is applied to encode the word sequences. Good experimental results show the feasibility of the proposed method.

Categories and Subject Descriptors: H.4.3 [Information Systems Applications]: Communications Applications; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Linguistic processing, dictionaries; H.2.8 [Database Management]: Database Applications—Data mining, statistical databases; I.7 [Document and Text Processing]

General Terms: Security, Algorithms

Additional Key Words and Phrases: Data hiding, Wikipedia mining, collaborative writing, revision history, Huffman coding ACM Reference Format:

Lee, Y. L. and Tsai, W. H. 2013. A new data hiding method via revision history records on collaborative writing platforms.

1. INTRODUCTION

Data hiding is the art of hiding secret messages into cover media for the applications of covert communication, secret data keeping, access control, database protection, and so on. Types of cover media include image, video, audio, text, etc. Attacking the weaknesses of human auditory and visual systems, many researches on data hiding focused on non-text cover media, such as [Cheddad et al.

2010; Doerr and Dugelay 2003; Lie and Chang 2006; Lin et al. 2011; Mohanty and Bhargava 2008; Tai et al. 2009]. Less data hiding techniques using text-type cover media have been proposed. Bennett [2004] made a good survey about hiding data in text and classified related techniques into three categories: format-based methods, random and statistical generation, and linguistic methods.

Format-based methods use the physical formats of documents to hide messages. Some of them utilize spaces in documents to encode message data. For example, Alattar and Alattar [2004] proposed a method that adjusts the distances between words or text lines using spread-spectrum and BCH error-correction techniques, and Kim et al. [2003] proposed a word-shift algorithm that adjusts the spaces between words based on concepts of word classification and statistics of inter-word spaces.

Some other methods utilize non-displayed characters to hide messages, such as Lee and Tsai [2010]

who encode message bits using special ASCII codes and hide the result between the words or characters in PDF files.

Random and statistical methods generate directly camouflage texts with hidden messages to prevent the attack of comparison with a known plaintext. For example, Wayner [Wayner 1992;

This research was supported in part by the NSC, Taiwan under Grant No. 101-3113-P-009-006 and in part by the Ministry of Education, Taiwan under the 5-year Project II of “Aiming for the Top University” from 2011 through 2015.

Authors’ addresses: Y. L. Lee, Institute of Computer Science and Engineering, National Chiao Tung University, Hsinchu 30010, Taiwan; email: yllee.cs98g@g2.nctu.edu.tw; W. H. Tsai, Department of Computer Science, National Chiao Tung University, Hsinchu 30010, Taiwan; email: whtsai@cis.nctu.edu.tw.

(2)

Wayner 2002] proposed a method for text generation based on the use of context-free grammars and tree structures. A method available on a website [Spammimic.com 2010] extends the idea to generate fake spam emails with hidden messages, which are usually ignored by people.

Linguistic methods use written natural languages to conceal secret messages. For example, Chapman et al. [2001] proposed a synonym replacement method that generates a cover text according to a secret message using sentence models and a synonym dictionary. Bolshakov [2004] extended the synonym replacement method by using a specific synonymy dictionary and a very large database of collocations to create a cover text, which is more believable to a human reader. Shirali-Shahreza and Shirali-Shahreza [2008] proposed a third synonym replacement method that hides data in a text by substituting the words which have different terms in the UK and the US. Stutsman et al. [2006]

proposed a method to hide messages in the noise that is inherent in natural language translation results without the necessity of transmitting the source text for decoding.

Recently, more and more collaborative writing platforms are available, such as Google Drive, Office Web Apps, Wikipedia, etc. On these platforms, a huge number of revisions generated during the collaborative writing process are recorded. Furthermore, many people work collaboratively on these platforms. Thus, these platforms are very suitable for data hiding applications, such as covert communication, secret data keeping, etc. It is desired to propose a new method which is useful for covert communication or secure keeping of secret messages on collaborative writing platforms However, the above-mentioned methods can only be applied to documents with single authors and single revision versions, meaning that they are not suitable for hiding data on collaborative writing platforms. Therefore, the goal of this paper is to propose a new data hiding method which can hide data into documents created on collaborative writing platforms. In more detail, a new data hiding method is proposed, which simulates a collaborative writing process to generate a fake document, consisting of an article and its revision history, as a camouflage for message bit embedding. As shown in Figure 1, with the input of an article and a secret message, the proposed method utilizes multiple virtual authors to collaboratively revise the article, generating artificially a history of earlier revisions of the article according to the secret message. An ordinary reader will consider the resulting stego- document as a normal collaborative writing output, and cannot realize the existence of the secret message hidden in the document.

Fig. 1. Basic idea of proposed method that generates a revision history of a stego-document as a camouflage for data hiding.

Moreover, the previously-mentioned linguistic methods use written natural languages to generate stego-documents and can produce more innocuous stego-texts than other data hiding methods, but an issue common to them is how to find a nature way for simulating the writing process and how to obtain large-volume written data automatically. Hence, another goal of this paper is to find a nature way to generate the revision history and to obtain large-volume collaborative writing data automatically. In recent years, some researches have been conducted to analyze the revision history data of Wikipedia articles for various natural language processing applications [Bronner and Monz 2012; Bronner et al. 2012; Dutrey et al. 2010; Erdmann et al. 2009; Max and Wisniewski 2010; Nelken and Yamangil 2008; Viégas et al. 2004], such as spelling corrections, reformulations, text summarization, user edits classification, multilingual content synchronization, etc. In addition to being useful for these applications, the collaboratively written data in Wikipedia are also very suitable,

(3)

as found in this study, for simulating the collaborative writing process for the purpose of data hiding since it is the largest collaborative writing platform nowadays.

In [Liu and Tsai 2007], Liu and Tsai proposed a data hiding method via Microsoft Word documents by the use of the change tracking function, which embeds a secret message by mimicking a pre-draft document written by an author with an inferior writing skill and encoding the secret message by choices of degenerations in the writing. Although they used three databases for degenerations, the sizes of them are quite small when compared to that of the database constructed from Wikipedia which we make use for data embedding in this study. It is noted by the way that a data hiding method can, as well known, embed more bits by making use of a larger database. Furthermore, in [Liu and Tsai 2007] a stego-document is generated by only two virtual persons and the change tracking data are made by the one with a better writing skill. This scenario is insufficient for simulating a normal collaborative writing process. Therefore, in this paper we propose a new framework that uses the revision-history data from Wikipedia and simulates real collaborative writing processes to hide secret messages. Four characteristics of collaborative writing processes are analyzed and utilized for message hiding, including the author of each revision, the number of corrected word sequences, the content of the corrected word sequences, and the word sequences replacing the corrected ones. The proposed method is useful for covert communication or secure keeping of secret messages on collaborative writing platforms.

In the remainder of this paper, the idea of the proposed method is described in Section 2. Detailed algorithms for collaborative writing database construction, secret message embedding, and secret message extraction are given in Section 3. In Section 4, some experimental results are presented to show the feasibility of the proposed method, and in Section 5, we discuss the security issue, followed by conclusions in Section 6.

2. BASIC IDEA OF PROPOSED METHOD

Collaborative writing means an activity involving more than one author to create an article cooperatively on a common platform. The purposes of establishing a collaborative writing platform includes knowledge sharing, project management, data keeping, etc. Many collaborative writing platforms are available, such as Google Drive, Office Web Apps, Wikipedia, etc., which record revisions generated during the collaborative writing process. In general, the recorded information of a revision includes: 1) the author of the revision, 2) the time the revision was made, and 3) the content of the revision.

To achieve the goal of creating camouflage revisions in collaborative writing for message hiding in this study, we analyze the existing revision-history data of articles on Wikipedia, which is the largest collaborative writing platform on the Internet currently in the world. The aim is to get real and large collaborative writing data contributed by people all over the world and use them to create more realistic revision histories to enhance the resulting effect of data embedding. However, since the collaborative writing process is very complicated, it is hard to find a unified model to simulate it.

Many different types of modifications may be made during the collaborative writing process [Bronner and Monz 2012; Dutrey et al. 2010], such as error corrections, paraphrasing, factual edits, etc.

Moreover, different languages usually require different models to represent due to their distinctive grammatical structures. Therefore, in order to get useful collaborative writing data automatically from the revision history data on Wikipedia without building models manually and to generalize a method that can be applied to multiple languages, we assume that only word sequence corrections occur during a revision. Some characteristics in collaborative writing based on this assumption for data embedding are identified, which will be discussed in the following. It is noted that various text articles, not only in English but also in other languages, can be utilized as cover media in this study.

(4)

The revision history of each article in Wikipedia is stored in a database, and one can recover any previous revision version of the article by an interface provided on the site. For this study, we have collected a large set of revision-history data from Wikipedia, and in the proposed method we mine this set to get useful information about word usages in the revisions. Then, we use the acquired information to simulate a collaborative writing process, starting from a cover article; and generate a stego-article with a sequence of revisions according to the secret message and a secret key. The resulting stego-document, including the stego-article and the revision history, looks like a work created by a group of real authors, achieving an effect of camouflage. In contrast, we call the original article with an initially-empty history a cover document in the sequel.

More specifically, the proposed method includes three main phases as shown in Figure 2: 1) construction of a collaborative-writing database; 2) secret message embedding; and 3) secret message extraction. In the first phase, a large number of articles acquired from Wikipedia are analyzed and useful collaboratively written data about word usages are mined using a natural language processing technique. The mined data then are used to construct a database, called the collaborative writing database, denoted as DBcw subsequently. In the second phase, with the input of a cover document, a secret message, and a secret key, a stego-document with a fake revision history is generated by simulating a real collaborative writing process using DBcw. The revisions in the history are supposed to be made by multiple virtual authors; and the following characteristics of each revision are decided by the secret message: 1) the author of the revision; 2) the number of changed word sequences of the revision; 3) the changed word sequences in the revision; and 4) the word sequences selected from the collaborative writing database DBcw, which replace those of 3), called the replacing word sequences in the sequel. And in the third phase, an authorized person who has the secret key can extract the secret message from the stego-document, while those who do not have the key cannot do so. They even could not realize the existence of the secret message because the secret message is disguised as the revision history in the stego-document. Note that the second and third phases can be applied on any collaborative writing platforms, not just on Wikipedia; Wikipedia is merely utilized in the first phase to construct the collaborative writing database DBcw in this study.

Fig. 2. Flow diagram of the proposed method.

3. DATA HIDING VIA REVISION HISTORY

In this section, the details of the proposed method for using the analyzed characteristics of collaborative writing to hide secret messages are described in the following, where the first part is collaborative writing database construction, the second part is secret message embedding, and the final part is secret message extraction.

3.1 Collaborative Writing Database Construction

To construct the aforementioned collaborative writing database DBcw, we try to mine the revision data collected from Wikipedia. There were about 4.2 million articles in the English Wikipedia in May 2013, which is a very large knowledge repository; therefore, it is suitable to use it as a source for

(5)

constructing the database DBcw desired in the study. Specifically, at first we downloaded part of the English Wikipedia XML dump with the complete revision histories of all the articles on August 3, 2011. Then, we mine the useful collaborative writing data from the downloaded data set under the assumption that only word sequence corrections will occur during a revision.

As illustrated in Figure 3, each downloaded article P has a set of revisions {D0, D1, …, Dn} in its revision history, where a newer revision Di has a smaller index i with D0 being the latest version of the article. For every two consecutive revisions Di and Di–1, we find all the correction pairs between Di

and Di–1, each denoted as <sj, sj′>, where sj is a word sequence in revision Di and was corrected to become another, namely, sj′, by the author of revision Di–1. Then, we collect all correction pairs so found to construct the database DBcw. For example, assume Di = “National Chia Tang University” and Di–1 = “National Chiao Tung University.” Then, the correction pair <s1, s1′> = <“Chia Tang”, “Chiao Tung”> is generated and included into DBcw.

Fig. 3. Illustration of used terms and notations.

Moreover, about the properties of correction pairs, it was observed that if the context of a word sequence sj in revision Di is the same as that of a word sequence sj′ in revision Di–1 (that is, if the preceding word of sj is the same as that of sj′ and the succeeding word of sj is the same as that of sj′ as well), then <sj, sj′> is a correction pair. Accordingly, a novel algorithm is proposed in this study for finding automatically all of the correction pairs between every two consecutive revisions for inclusion in DBcw. The algorithm is an extension of the longest common subsequence (LCS) algorithm [Bergroth et al. 2000]. The details are described in Algorithm 1.

Algorithm 1. Finding correction pairs

Input: two consecutive revisions Di and Di–1 in the revision history of an article P.

Output: the correction pairs between Di and Di–1. Stage 1  finding the longest common subsequence.

1) (Splitting revisions into word sets) Split Di and Di–1 into two sets of words, W = {w1, w2, …, wn} and W′ = {w1′, w2′, …, wm′}, respectively.

2) (Constructing a counting table by dynamic programming) Construct an nm counting table T to record the lengths of the common subsequences of W and W' as follows.

a) Initialize all elements in table T to be zero.

b) Compute the values of table T from the upper left and denote the currently-processed entry in T by T(x, y) with x = 1 and y = 1 initially.

c) If the content of wx is identical to that of wy′, then let T(x, y) = T(x – 1, y – 1) + 1; else, let T(x, y) = max (T(x – 1, y), T(x, y – 1)).

d) If x is not larger than n, then let x = x + 1 and go to Step 2c); else, if y is not larger than m, then let x = 1 and y = y + 1 and go to Step 2c); else, regard table T as being filled up and continue.

(6)

3) (Finding the longest common subsequence) Apply a backtracking procedure to table T, starting from T(m, n), to find the longest common subsequence L = {l1, l2, …, lt}, where each element li in L is a word common to W and W'.

Stage 2  finding the correction pairs.

4) (Finding the correction pairs) Starting from the first element l1 of L with the currently-processed element in L being denoted by lp, find the correction pairs as follows.

a) If the word sequence sj in Di with its preceding and succeeding words being lp and lp+1, respectively, is not empty and if the word sequence sj′ in Di–1 with the same context condition is not empty, either, then take <sj, sj′> as a correction pair.

b) Increment p by 1 and go to Step 4) until p > t.

We run Algorithm 1 for every two consecutive revisions of all the articles downloaded from Wikipedia to obtain a large set of correction pairs and write them into the database DBcw. Furthermore, we count the total number Ncp of times that each correction pair CP is so obtained, and call the number Ncp the correction count of CP. The correction counts are also kept in the database DBcw for use in the proposed data hiding process.

As a summary, we use a record in the database DBcw to keep the following information about a correction pair <sj, sj′>: 1) an original word sequence sj; 2) a new word sequence sj′; and 3) the correction count Ncp of the pair. Moreover, we define a chosen set of a word sequence s' in DBcw to be the one which include all the correction pairs <s, s'> with s' as their identical new word sequences. For example, Table III (shown in Section IV) shows a chosen set of the word sequence “such as.”

3.2 Secret Message Embedding

In the phase of message embedding with a cover document D0 as the input, the proposed system is designed to generate a stego-document D′ with consecutive revisions {D0, D1, D2, …, Dn} by producing a previous revision Di from the current revision Di–1 repeatedly until the entire message is embedded, as shown in Figure 3 where the direction of revision generation is indicated by the green arrows. The stego-document D′ including the revision history {D0, D1, D2, …, Dn} then is kept on a collaborative writing platform, which may be Wikipedia or others. To simulate a collaborative writing process more realistically, we utilize the four aforementioned characteristics of revisions to “hide” the message bits into the revisions sequentially: 1) the author of the previous revision Di, 2) the number of changed word sequences in the current revision Di–1, 3) the changed word sequences in the current revision Di–1, and 4) the replacing word sequences in the previous revision Di, as described in the following.

3.2.1 Encoding the Authors of Revisions for Data Hiding. We encode the authors of revisions to hide message bits in the proposed method. For this, at first we select a group of simulated authors, with each author being assigned a unique code a, called author a. Then, if the message bits to be embedded form a code aj, then we assign author aj to the previous revision Di as its author to achieve embedding of message bits aj into Di. For example, assume that four authors are selected and each is assigned a unique code a as shown in Figure 4, respectively. If the message bits aj to be embedded is “01,” then Jessy with author code “01” is selected to be the author of the revision Di. Moreover, every revision of D0 through Dn will be assigned an author according to the corresponding message bits, and so an author can be assigned to conduct more than one revision or reversely no revision in the generated revisions.

Fig. 4. Illustration of encoding authors of revisions for data hiding.

(7)

3.2.2 Using the Number of Changed Word Sequences for Data Hiding. In the process of generating the previous revision Di from the current one Di–1, we select some word sequences in Di–1 and changed them into other ones in Di. It is desired to use as well the number Ng of word sequences changed in this process as a message-bit carrier.

To implement this aim, at first we set on the magnitude of Ng a limit Nc taken to be the maximum allowed number of word sequences in Di–1 that can be changed to yield Di. This limitation makes the simulated step of revising Di–1 to become Di look more realistic because usually not very many words are corrected in a single revision. Next, we scan the word sequences in the text of the current revision Di–1 sequentially and search DBcw to find all the correction pairs <sj, sj'> with sj' in Di–1. Then, we collect all sj' in these pairs as a set Qr, which we call the candidate set of word sequences for changes in Di–1. Finally, we select Ng word sequences in Qr to form a set Qc such that the binary version of the number Ng is just the current message bits to be embedded.

But for this process of using Ng as a message-bit carrier to be feasible, several problems must be solved beforehand, including: 1) the dependency problem, 2) the selection problem, 3) the consecutiveness problem, and 4) the encoding problem, as described in the following.

3.2.2.1. The dependency problem. We say that two word sequences in Di–1 are dependent if some identical words appear in both of them, and changing word sequences with this property in Di–1 will cause conflicts, leading to a dependency problem which we explain by an example as follows.

As shown in Figure 5(a), Di–1 = “you are not wrong, who deem that my days have been a dream”

and Qr includes 11 word sequences denoted as q1 through q11, respectively. From Figure 5(a) we can see that the word sequences q2, q3, and q5 in Qr are dependent on the word sequence q4 because the intersection of each of the former three with the latter one is non-empty. If we correct q4 = “are not wrong” in Di–1 to be another, say “is right,” then the dependent word sequences q2, q3, and q5 in Di–1

cannot be selected and changed anymore because they include word sequences in q4 which have already been changed and disappeared. That is, any part of a changed word sequence cannot be changed again; otherwise, a dependency problem will occur.

(a) (b) Fig. 5. Illustration of the dependency problem. (a) Revision D_i1 and candidate set Qr where the dependent word sequences are

surrounded by red squares. (b) Set I that corresponds to the set Qr for solving the dependency problem.

To avoid this problem in creating Di from Di–1, we propose a two-step scheme: 1) decompose Qr into a set of lists, I = {I1, I2, …, Iu}, with each list Ii including a group of mutually dependent word sequences (i.e., with every word sequence in each Ii being dependent on another in the same list) and every two word sequences in two different lists, respectively, in I being independent of each other; and 2) select only word sequences from different lists in set I and change them to construct a new revision.

The details to implement the first step is described in Algorithm 2. After applying the first step on the set Qr as shown in Figure 5(b), it will be transformed into I = {(q1), (q2, q3, q4, q5), (q6), (q7), (q8), (q9, q10), (q11)} where each pair of parentheses encloses a list of mutually dependent word sequences. With I

(8)

ready, we can now select word sequences from distinct lists Ii in it, such as q1, q2, q6, and q9, to simulate changes of word sequences in revision Di–1 without causing the dependency problem.

3.2.2.2. The selection problem. It is desired to select word sequences for use in the simulated revisions according to their usage frequencies in DBcw, so that a more frequently-corrected word sequence has a larger probability to be selected, forging a more realistic revision. For this aim, following [Liu and Tsai 2007; Wayner 1992], we adopt the Huffman coding technique to create Huffman codes uniquely for the word sequences in Qr according to their usage frequencies, and select word sequences with their codes identical to the message bits to be embedded. Specifically, according to a property of Huffman coding, the lengths of the resulting Huffman codes of word sequences are in reverse proportion to the usage frequencies of the word sequences. So a word sequence with a shorter Huffman code will have a larger probability to be selected, which can be computed as (1/2)^L where L denotes the number of bits of the code. That is, the use of Huffman coding indeed can achieve the aim of selecting word sequences in favor of those which are more frequently corrected in real cases.

But a problem arises here  after we select one word sequence qy in this way, qy cannot be used in the revision again for encoding an identical succeeding code in the message because qy has already been changed into another word sequence, causing a problem which we call the selection problem. This problem comes partially from the unique decidability property of Huffman coding. To illustrate this problem, for the previous example as shown in Figure 5 again, the Huffman codes for word sequences q1 through q11 are shown in Figure 6(a), and the message bit sequence to be embedded currently is

“100100…” with the first six bits being just two repetitions of the code “100.” For this, at first we select word sequence q4 and change it into another in the revision because the first three message bits to be embedded, “100,” are just the code for q4 (indicated by red color). After this, the next three message bits to be embedded are again the code “100” (the blue color of message bits in Figure 6(a));

however, the corresponding word sequence q4 cannot be selected any further because it has already been changed in the current revision version, and other word sequences cannot be selected, either, because their codes are not the same as the current message bits “100” to be embedded.

(a) (b) Fig. 6. Illustration of the selection problem. (a) Huffman codes for the word sequences and the message bits that are

encountered in the selection problem. (b) Dividing of the word sequences into groups to solve the selection problem.

To solve this selection problem, suppose that based on the use of a key, we assign randomly the word sequences in Qr consecutively into Ng groups G1 through GN^g, each group including multiple, but distinct, word sequences, where Ng is the number of word sequences changed in Di–1. Then, starting from group G1, we apply Huffman coding to assign codes to all word sequences in the currently- processed group Gk according to their usage frequencies, and select a word sequence in Gk with its assigned code identical to the leading message bits for use in the revision. We apply this step repetitively until all groups are processed. In this process, Huffman coding is applied to each Gk with word sequences distinct from those in the other groups, so that the selection problem of choosing a word twice to change due to code repetition in the message will not happen any more. For example, as shown in Figure 6(b), Qr is divided into three groups: G1, G2, and G3, represented by red, blue, and green colors, respectively. Starting from G1, we assign Huffman codes to the elements in each group as

(9)

shown in Figure 6(b). Then, q2 will be selected because the code of q2 is the same as the first three bits

“100…” of the message to be embedded. Then, next in G2, q8 will be selected because the message bits to be embedded are currently “100…” Finally, q11 in G3 will be selected because the current message bits to be embedded are “0…” In this way, the previous problem of being unable to embed the repetitive code “100” is solved automatically. In short, by decomposing randomly the candidate set Qr

of word sequences for changes into groups and representing each group by a Huffman code, we can embed message bits sequentially by changing only one word sequence in each group without causing the selection problem.

However, the above process is insufficient; it must be modified in such a way that word sequences which have mutual dependency relations are divided into an identical group in order to avoid the dependency problem as discussed in Section 3.2.2.1. For this aim, instead of decomposing the word sequences in Qr directly into random groups as mentioned previously, we divide randomly the mutually-independent list elements of I mentioned in Section 3.2.2.1 into Ng groups, where each group is denoted by GIk. Then, we take out all the word sequences in the lists in each GIk to form a new group of word sequences, denoted as Gk, resulting again in Ng groups of word sequences. For instance, for the previous example as shown in Figure 5, let Ng = 2 and suppose that the list elements of I are decomposed randomly into two groups: GI1 = {I1, I2, I3, I4} and GI2 = {I5, I6, I7}. Then, this procedure will yield the two groups of G1 = {q1, …, q7} and G2 = {q8, q9, q10, q11}.

3.2.2.3. The consecutiveness problem. As shown in Figure 7(a), for example, the word sequence

“increase in” in revision Di–1 is seen to become “improve themselves” in revision Di. This effect comes from two changes made during message embedding: the word sequence “increase” in Di–1 was changed to be “improve” in Di; and the word sequence “in” in Di–1 was changed to be “themselves” in Di. However, because of the consecutiveness of the two words “improve” and “themselves” in Di, the two changes might be considered as a single one during secret message extraction, i.e., the word sequence

“increase in” in Di–1 might be regarded to have been changed to be “improve themselves” in Di. This ambiguity causes a problem, namely, we cannot know whether a change from a word sequence in Di–1

to be another in Di is from one group or two, or equivalently, we cannot know the true number Ng of changed word sequences in Di–1, so that we cannot extract later the embedded messages bits correctly.

We call this difficulty in message extraction a consecutiveness problem.

(a) (b) Fig. 7. Illustration of the consecutiveness problem. (a) An example for illustration of the consecutiveness problem. (b) Choosing

splitting points randomly to solve the consecutiveness problem.

Obviously, word sequences in different groups must be made non-consecutive in order to solve the problem. For this aim, the previously-mentioned solution to the selection problem is modified further.

Specifically, by the use of a key again we choose randomly Ng – 1 lists, say Ii1, Ii2 , …, IiN (with N = Ng  1), of the set I for use as splitting points to divide I into Ng groups with Ii1 through IiN not included in any of the Ng groups. For instance, let Ng = 2 for the previous example as shown in Figure 5 and the number of splitting points may be computed accordingly to be Ng – 1 = 1. Consequently, as shown in

(10)

Figure 7(b), we choose a splitting point, say I5, to divide the set I into two groups: GI1 and GI2, both not including I5. The final groups of word sequences then become: G1 = {q1, q2, q3, q4, q5, q6, q7} and G2 = {q9, q10, q11}. Because of the existence of the splitting point I5 = (q8), groups G1 and G2 are non-consecutive, and accordingly uses of them for creating word sequence changes in revisions will now cause no consecutiveness problem.

3.2.2.4. The encoding problem. The issue up to now is how to determine the aforementioned number Ng of word sequences to be changed in Di–1. Although a limit Nc is set for Ng, the maximum number Nm

of word sequences that can be selected in Di–1 may even be smaller than Nc. Therefore, we must compute Nm first before we can embed message bits according to the number Ng. After Nm is decided, Ng may then be taken to be a number not larger than Nm. The actual value of Ng is decided by the leading secret message bits, say nm ones. Consequently, we may assume that Nm satisfies the two constraints of 1) Nm = 2ⁿ^m and 2) 1 ≤ Nm ≤ Nc, where nm is a positive integer. In addition, in order to embed message bits by selecting a word sequence from a group Gk, the number of elements in Gk

should not be smaller than two so as to embed at least one message bit by Huffman coding; hence, each group GIk mentioned previously should be created to include at least two elements of I.

Accordingly, the maximum number Nm of word sequences to be changed in Di–1 can be figured out to satisfy the following formula:

[NI  (Nm  1)]/Nm  2, (1)

where NI is the number of elements in set I and Nm – 1 represents the aforementioned number of chosen splitting points. The inequality (1) can be reduced to

Nm  (NI + 1)/3. (2)

Accordingly, we can compute Nm by the following rule:

if (NI + 1)/3 > Nc, set Nm = Nc;

if 1  (NI + 1)/3  Nc, set N_m 2^^^{log (}² ^N^I^^1)/3^^. (3)

Furthermore, the content of Di–1 might be too little for Nm to be decided by Eq. (3). In that case, we abandon the original cover document D0 from which Di–1 is generated, and use another longer cover document as the input. After the value of Nm is computed, we can then use the leading nm bits of the message to decide the number Ng of changed word sequences in Di–1 by two steps: 1) express the first nm message bits as a decimal number; and 2) increment the decimal number by one. The second step is required to handle the case that the first nm message bits are all zeros, which leads to the undesired result of no word sequence being changed in the current revision. In this way, Ng becomes really a carrier of nm message bits. For example, the number of elements of the set I for the previously- mentioned example as shown in Figure 5 is NI = 7. Let Nc = 4. Because (NI + 1)/3 = (7 + 1)/3 ≈ 2.67 ≤ 4

= Nc, Nm is computed to be log (7 1)/32

2^^ ^ ^^ = 2¹ according to Eq. (3). So, nm = log2Nm = 1. And if the secret message is “101001…,” then the number Ng of changed word sequences should be taken, according to the above two steps, to be Ng = (1)2 + (1)10 = 2 because the first bit of the secret message is “1.”

3.2.3 Encoding the Changed Word Sequences in the Current Revision for Data Hiding. According to the previous discussions, we may assume that we have computed the number Ng of word sequences which should be changed in the current revision Di–1 according to the first nm bits of the secret message, and that we have classified the available word sequences in Qr into Ng groups, where each group Gk includes at least two word sequences and all word sequences in Gk are encoded by Huffman coding according to their usage frequencies. Specifically, the usage frequency of a word sequence sj' is taken to be the summation of the correction counts of all the correction pairs in the chosen set of sj', which have sj' as their common new word sequence. Then, starting from G1, we may select from each group Gk one word sequence with a Huffman code identical to the leading bits of the message to be embedded, achieving the goal of data hiding via changing word sequences in Di–1.

(11)

For example, assume that the usage frequencies of the word sequences in group G2 as shown in Figure 7(b) are: q9 = 100, q10 = 50, and q11 = 150; and the message is “10100….” Then, the Huffman codes assigned to q9, q10, and q11 are “01,” “00,” and “1,” respectively; and so we select q11 to hide the first bit “1” of the message because the code of q11 is “1.”

3.2.4 Encoding the Replacing Word Sequences in the Previous Revision for Data Hiding.

Symmetrically, we may use as well the replacing word sequences in Di to embed message data, where each replacing word sequence sj in Di corresponds to a changed word sequence sj' in Di–1, forming a correction pair <sj, sj′>. Specifically, recall that for each sj', we can find a chosen set of correction pairs from DBcw. From this set, we can collect all the original word sequences of the correction pairs as another set Qc', with each word sequence in Qc' being appropriate for use as the replacing word sequence sj. Let Qc' = {s1, s2, …, sw}. Then, to carry out message data hiding, we encode all sj in Qc' by Huffman coding according to their usage frequencies as well, and choose the one with its code identical to the leading message bits for use as the word sequence sj replacing sj'. Here the usage frequency of each sj is the correction count of the correction pair <sj, sj'>. For example, Table III shows the chosen set of the word sequence “such as” with all included original word sequences already assigned Huffman codes according to their usage frequencies. Based on the table, if the message to be embedded currently is “01001001…,” then we change the word sequence “such as” in the current revision Di–1 to be the word sequence “for example” in the previous revision Di because the Huffman code for “for example,” namely, 0100, is the same as the first four bits of the secret message.

3.2.5 Secret Message Embedding Algorithm. As a summary, we have demonstrated the usability of the aforementioned four characteristics of revisions for data hiding. Therefore, we can generate a stego-document with a forged revision history which looks like a realistic work written by people collaboratively. The details of the proposed message embedding process are described in Algorithm 2 below.

Algorithm 2. Secret message embedding

Input: a cover document D0 with an article to be revised collaboratively, a binary message M of length t, a secret key K, and a collaborative writing database DBcw constructed by Algorithm 1.

Output: a stego-document D′ with a revision history {D0, D1, D2, …, Dn}.

Stage 1  message preparation and parameter determination.

1) (Message composition) Affix an s-bit binary version of t to the beginning of M to compose a new binary message M′, where the value s is agreed by the sender and the receiver beforehand.

2) (Message encryption) Randomize M′ to yield a new binary message M′′ using K.

3) (Parameter determination) Use K to decide randomly both the number Na of authors and the limit Nc on the number Ng of word sequences to be changed in every revision.

4) (Author encoding) Use K to select Na authors from those who were involved in works conducted on the collaborative writing platform to form an author list Ia, and assign a unique na-bit code to each selected author in Ia.

Stage 2  revision generation and message embedding.

5) (Message embedding and revision generation) Generate the previous revision Di from the current revision Di–1 repeatedly while embedding the binary message M′′ by running Algorithm 4 which was designed according to the schemes described in Sections 3.2.1 through 3.2.4 and is shown in the Online Appendix with the inputs Di–1, M'', K, and Ia until all bits in M'' are embedded, where i = 1 initially.

6) If message M'' is not exhausted, then repeat the above process to generate more revisions;

otherwise, collect the finally-revised article and the history of all the revisions, D0 through Dn, as a stego-document D'; and take D' as the output for use on the collaborative writing platform.

(12)

3.3 Secret Message Extraction

We can extract the secret message in the stego-document by a reverse version of the message embedding process described by Algorithm 2. The details are described by Algorithm 3 in the following.

Algorithm 3. Secret message extraction

Input: a stego-document D′ including revision history {D0, D1, D2, …, Dn} of a collaboratively-revised article, the secret key K used in Algorithm 2, and the database DBcw constructed by Algorithm 1.

Output: a binary message M.

Stage 1  preparation.

1) (Parameter determination and author encoding) Use K to decide randomly the number Na of authors, the limit Nc on the number Ng of word sequences to be changed in every revision, and the list Ia of Na authors, in the same way as Steps 3 and 4 of Algorithm 2.

Stage 2  encrypted message extraction.

2) (Message bit extraction) For each revision Di–1 with i = 1 initially, extract the binary message m by running Algorithm 5 which is essentially a reverse version of Algorithm 4 and shown in the Online Appendix with the inputs Di–1, Di, K, and Ia; and append the result m to a bitstream M'' until i > n where M'' is set empty initially.

Stage 3  message content recovery.

3) (Message decryption) Decrypt the bitstream M′′ to get M' using K.

4) (Message extraction) Express the first s bits of M' in decimal form as t and output the (s + 1)th through (s + t)th message bits of M' as the secret message M.

4. EXPERIMENTAL RESULTS

A collaborative writing database DBcw was constructed by mining the huge collaborative writing data in Wikipedia using Algorithm 1 described previously. Note that this is a totally automatic work and need be performed only once for building the database DBcw using Algorithm 1, where 3,446,959 different correction pairs were mined from 2,214,481 pages with 33,377,776 revisions in English Wikipedia XML dump. The total size of the downloaded Wikipedia data is about 210.3 GB and the size of the mined data is just 888 MB. Moreover, some revisions might suffer from vandalism [Bronner and Monz 2012; Dutrey et al. 2010], and by the method proposed by Bronner and Monz [2012], such revisions were ignored if they have been reverted due to vandalism. Also, keywords in Wiki markup¹ were ignored as well. Table I shows the top 20 most frequently used correction pairs, where the one in the first place is the pair <“BCE”, “BC”> with a correction count of 19,430. Table II shows some correction pairs, each having more than one word either in its original word sequence or in its new word sequence. One of the correction pairs in this table is <“like”, “such as”> with a correction count of 773.

The constructed database DBcw contains 1,688,732 chosen sets of correction pairs where all the correction pairs in a chosen set have identical new word sequences, meaning that there are 1,688,732 word sequences which can be chosen and changed to other word sequences in the message embedding phase. Figure 8 shows an illustration of the numbers of entries in the chosen sets with sizes from 2 to 40. Table III shows the content of a chosen set with the new word sequence “such as,” as well as the usage frequency and Huffman code for each original word sequence which may be replaced by “such

1 http://en.wikipedia.org/wiki/Help:Wiki_markup

(13)

as” during message embedding. From the table, we can see that the most frequently used original word sequence is “like,” so it has the shortest code “1” and the largest probability to be chosen.

After the message embedding phase, the proposed system will generate a stego-document to be kept in a collaborative writing platform and a user can later extract the embedded message from it using a key. Each generated stego-document including its revision history was kept on a Wiki site which was constructed in this study using the free software: MediaWiki². Note that though here the pre-selected collaborative writing platform is the constructed Wiki site, yet the proposed method can be used on other collaborative writing platforms as well. As an example, with a cover article as shown in Figure 9(a), the message “Art is long, life is short,” and the key “1234” as inputs into Algorithm 2, a stego-article as shown in Figure 9(c) together with a revision history as shown in Figure 9(b) was

2 http://www.mediawiki.org/wiki/MediaWiki.

Fig. 8. The number of entries of chosen sets with the size from 2 to 40.

Table II. Some Correction Pairs Each with More Than One Word Either in the Original Word Sequence or in the New Word Sequence

Original word sequence

New word sequence

Usage frequency

New word sequence

Usage frequency

Irish evil Evil 2,367 due to because of 933

Evil Irish evil 2,253 like such as 773

US United States 1,094 didn't did not 665

It's It is 1,052 passed away died 374

due to the fact that because 359 doesn't does not 489

have been were 348 WWII World War II 395

will be was 903 UK United Kingdom 599

Table I. Top Twenty Frequently Used Correction Pairs Original word

sequence

New word sequence

Usage frequency

New word sequence

Usage frequency

BCE BC 19,430 the a 7,009

BC BCE 17,878 is are 6,908

color colour 15,356 a the 6,278

colour color 14,852 are is 5,430

The the 14,232 colors colours 5,301

a an 9,792 colours colors 5,078

it's its 9,658 CE AD 4,833

is was 9,607 AD CE 4,262

an a 8,954 image Image 4,259

was is 7,407 was were 3,924

(14)

generated by the proposed method. We can see from Figure 9(b) that five revisions have been created in order to embed the secret message. And Figures 9(d) and 9(e) show the extracts of the differences between the two newest revisions, where the words in red in Fig. 9(d) were corrected to be those in red in Figure 9(e) by the author “Natalie.” Figures 9(f) and 9(g) shows respectively the messages extracted by Algorithm 3 using a right key and a wrong one. These results show that when a user uses a wrong key, the system will return a random string as the message extraction result.

(a) (b) (c)

(d) (e)

(f) (g) Fig. 9. An example of generated stego-documents on constructed Wiki site with input secret message “Art is long, life is short.”

(a) Cover document. (b) Revision history (c) Stego-document. (d) Previous revision of revision of (e) with words in red being those corrected to be new words in revision of (e) in red. (e) Newest revision of created stego-document. (f) Correct secret message extracted with the right key “1234.” (g) Wrong extracted secret message with a wrong key “123.”

A series of experiments with different parameters have also been conducted to quantitatively measure the data embedding capacity of the proposed method using a lot of cover documents as inputs.

Since the data embedding capacity is dependent on the secret message content which influences the selections of authors and changed word sequences for each revision, we have run experiments for each document ten times using different messages as inputs, and recorded the average of the resulting data embedding capacities. The parameters of six different cover documents are shown in Table IV. For example, document 1 has 2,419 characters, 641 words, and 80 sentences; document 3 has 10,128 characters, 2,211 words, and so on.

In these experiments, firstly we selected the replacing word sequences for a revision to be the top n most frequently used ones in the database DBcw, where n = 2, 4, 8, 16, 32. Figure 10(a) shows the resulting data embedding capacities from which we can see that the more the selected replacing word sequences, the more the embedded message bits. This result comes from the fact that when more replacing word sequences are available, the constructed Huffman codes will become longer.

Table III. An Example of a Chosen Set with the New Word Sequence “such as”

Original word

sequence Usage frequency Huffman code Original word

sequence Usage frequency Huffman code

like 773 1 specifically 12 011001

including 143 00 namely 10 011000

for example 39 0100 particularly 10 0111111

of 29 01110 like the 10 010100

notably 23 01101 most notably 10 010101

especially 20 01011 include 9 0111110

and 16 011110

(15)

We have also conducted experiments on using different numbers of revisions (1, 2, 4, 8) in the generated stego-documents to see the resulting data embedding capacities. Figure 10(b) shows the results which indicate that when the number of revisions in the stego-document is larger, more message bits can be embedded, as expected. This means that if we want to embed a larger secret message, more revisions should be generated. Yet, on a Wiki site, each revision will be stored as its original text without any compression. Thus, a larger storage space is required to store more generated revisions when the secret message is longer. However, one can solve this issue by simply comparing the difference between two adjacent revisions and only storing the difference between them where this comparison function may be provided by other collaborative writing platforms if desired.

Furthermore, we can see also from Figures 10(a) and 10(b) that when a cover document has a larger size, the resulting data embedding capacity will be larger as well. Thus, if we want to embed more data, we have to choose a larger cover document.

(a) (b) Fig. 10. The embedding capacities. (a) Embedding capacities of documents with chosen sets of different sizes. (b) Embedding

capacities of documents with different number of revisions.

Figure 11 shows a comparison of the resulting embedding capacities yielded by the proposed method with those yielded by Liu and Tsai’s method [2007]. We can see from Figure 11 that when the number of revisions of the proposed method is equal to one, the embedding capacity of the proposed method is very close to that yielded by Liu and Tsai [2007]. Note that not every word sequence in the current revision Di–1 can be utilized for data embedding in the proposed method, because we limit the maximum number of corrected word sequences in a revision. Thus, when the number of revisions is just one, the embedding capacity of the proposed method may not be better than that of Liu and Tsai [2007] which allows the use of every word for message embedding. However, when the number of revisions is equal to or greater than two, the embedding capacities of the proposed method are instead much larger.

Like the methods proposed by [Bronner and Monz 2012 Bronner et al. 2012] which can be utilized for multiple languages, we have tried to apply Algorithm 1 to two adjacent revisions of a Chinese document and obtain the correction pairs for them successfully, where the two revisions are shown in Figure 12. Note that since Chinese has no explicit word segmentation mark, we cannot use spaces to split an article in Chinese into words. Therefore, each character in Chinese was treated as a word directly to solve the issue. Figure 12 shows the found correction pairs between the two revisions, in which, e.g., one of the found correction pairs is <做到, 達成>, where both word sequences in the pair mean the same as “achieve” in English.

Table IV. The Information of Experimental Documents

Document Character Word Sentence Document Character Word Sentence

Document 1 2,419 641 80 Document 4 11,215 2,617 86

Document 2 4,762 956 45 Document 5 26,591 6,180 631

Document 3 10,128 2,211 121 Document 6 60,349 14,306 1,603

(16)

Fig. 11. Comparison of embedding capacities yielded by Liu and Tsai [2007] and proposed method using different numbers of revisions.

Fig. 12. An example to show the interoperability of the proposed method which can be applied on Chinese articles.

Moreover, for the purpose of presenting the contributions made by the proposed method, we have compared it with several other methods for data hiding via texts [Bolshakov 2005; Chapman et al.

2001; Shirali-Shahreza and Shirali-Shahreza 2008; Liu and Tsai 2007] as shown in Table V. Firstly, the synonym replacement methods [Bolshakov 2005; Chapman et al. 2001; Shirali-Shahreza and Shirali-Shahreza 2008] utilize synonym dictionaries to embed messages, where the synonym dictionaries were usually manually built by language experts. And the embedding capacities of these methods are limited, since only those word sequences in the cover document which exist in the synonym dictionary can be utilized for data embedding. Also, since they replace the word sequences in a cover document into their synonyms, the resulting stego-document is usually a worse version of the original cover document due to the possible losses of the original meanings in the replacements.

Furthermore, the usage frequencies of the corresponding synonyms of a word sequence are not analyzed in these methods. Secondly, the change tracking method proposed by Liu and Tsai [2007]

utilizes synonym dictionaries and a small collaborative writing database with only 7,581 chosen sets to embed messages, where the synonym dictionaries were built manually as well. Also, the embedding capacities of this method is limited, since only two revisions are generated by two authors and only the word sequences in the cover document are degenerated for data embedding. Moreover, the usage frequencies of word sequences of this method are just a simulated one created by using the Google SOAP Search API.

As a summary, several merits of the proposed method can now be pointed out, which include: (1) the database of the proposed method is constructed automatically from Wikipedia, which is the largest collaborative writing platform on the Internet; therefore, the resulting stego-document generated by the proposed method is more realistic than that generated by the other four methods [Bolshakov 2005;

Chapman et al. 2001; Shirali-Shahreza and Shirali-Shahreza 2008; Liu and Tsai 2007]; (2) the dababase constructed by the proposed method is much larger than that by Liu and Tsai [2007], with 1,688,732 chosen sets in the former and only 7,581 in the latter; (3) the usage frequency of each correction pair used in the proposed method is a real parameter obtained by mining the collaborative writing data found on Wikipedia, but that of Liu and Tsai [2007] is just a simulated one created by using the Google SOAP Search API; and (4) the proposed method can simulate the collaborative