A New Data Hiding Method via Revision History Records on Collaborative Writing Platforms

(1)

20

A new data hiding method via collaboratively-written articles with forged revision history records on collaborative writing platforms is proposed. The hidden message is camouflaged as a stego-document consisting of a stego-article and a revision history created through a simulated process of collaborative writing. The revisions are forged using a database constructed by mining word sequences used in real cases from an English Wikipedia XML dump. Four characteristics of article revisions are identified and utilized to embed secret messages, including the author of each revision, the number of corrected word sequences, the content of the corrected word sequences, and the word sequences replacing the corrected ones. Related problems arising in utilizing these characteristics for data hiding are identified and solved skillfully, resulting in an effective multiway method for hiding secret messages into the revision history. To create more realistic revisions, Huffman coding based on the word sequence frequencies collected from Wikipedia is applied to encode the word sequences. Good experimental results show the feasibility of the proposed method.

Categories and Subject Descriptors: H.4.3 [Information Systems Applications] Communications Applications; H.3.1

[In-formation Storage and Retrieval]: Content Analysis and Indexing—Linguistic processing, dictionaries; H.2.8 [Database Management]: Database Applications—Data mining, statistical databases; I.7 [Document and Text Processing]

General Terms: Security, Algorithms

Additional Key Words and Phrases: Data hiding, Wikipedia mining, collaborative writing, revision history, Huffman coding

ACM Reference Format:

Y.-L. Lee and W.-H. Tsai. 2014. A new data hiding method via revision history records on collaborative writing platforms. ACM Trans. Multimedia Comput. Commun. Appl. 10, 2, Article 20 (February 2014), 21 pages.

DOI: http://dx.doi.org/10.1145/2534408

1. INTRODUCTION

Data hiding is the art of hiding secret messages into cover media for the applications of covert com-munication, secret data keeping, access control, database protection, and so on. Types of cover media include image, video, audio, text, etc. Attacking the weaknesses of human auditory and visual sys-tems, many researches on data hiding focused on nontext cover media, such as [Cheddad et al. 2010; Doerr and Dugelay 2003; Lie and Chang 2006; Lin et al. 2011; Mohanty and Bhargava 2008; Tai et al.

This research was supported in part by the NSC, Taiwan under Grant No. 101-3113-P-009-006 and in part by the Ministry of Education, Taiwan under the 5-year Project II of “Aiming for the Top University” from 2011 through 2015.

Authors’ addresses: Y.-L. Lee, Institute of Computer Science and Engineering, National Chiao Tung University, Hsinchu 30010, Taiwan; email: yllee.cs98g@g2.nctu.edu.tw; W.-H. Tsai, Department of Computer Science, National Chiao Tung University, Hsinchu 30010, Taiwan; email: whtsai@cis.nctu.edu.tw.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax+1 (212) 869-0481, or permissions@acm.org.

c

2014 ACM 1551-6857/2014/02-ART20 $15.00 DOI: http://dx.doi.org/10.1145/2534408

(2)

Fig. 1. Basic idea of proposed method that generates a revision history of a stego-document as a camouflage for data hiding.

2009]. Less data hiding techniques using text-type cover media have been proposed. Bennett [2004] made a good survey about hiding data in text and classified related techniques into three categories: format-based methods, random and statistical generation, and linguistic methods.

Format-based methods use the physical formats of documents to hide messages. Some of them utilize spaces in documents to encode message data. For example, Alattar and Alattar [2004] proposed a method that adjusts the distances between words or text lines using spread-spectrum and BCH error-correction techniques, and Kim et al. [2003] proposed a word-shift algorithm that adjusts the spaces between words based on concepts of word classification and statistics of inter-word spaces. Some other methods utilize nondisplayed characters to hide messages, such as Lee and Tsai [2010] who encode message bits using special ASCII codes and hide the result between the words or characters in PDF files.

Random and statistical methods generate directly camouflage texts with hidden messages to prevent the attack of comparison with a known plaintext. For example, Wayner [Wayner 1992, 2002] proposed a method for text generation based on the use of context-free grammars and tree structures. A method available on a website [Spammimic.com 2010] extends the idea to generate fake spam emails with hidden messages, which are usually ignored by people.

Linguistic methods use written natural languages to conceal secret messages. For example, Chapman et al. [2001] proposed a synonym replacement method that generates a cover text according to a secret message using sentence models and a synonym dictionary. Bolshakov [2004] extended the synonym replacement method by using a specific synonymy dictionary and a very large database of collocations to create a cover text, which is more believable to a human reader. Shirali-Shahreza and Shirali-Shahreza [2008] proposed a third synonym replacement method that hides data in a text by substituting the words which have different terms in the UK and the US. Stutsman et al. [2006] pro-posed a method to hide messages in the noise that is inherent in natural language translation results without the necessity of transmitting the source text for decoding.

Recently, more and more collaborative writing platforms are available, such as Google Drive, Of-fice Web Apps, Wikipedia, etc. On these platforms, a huge number of revisions generated during the collaborative writing process are recorded. Furthermore, many people work collaboratively on these platforms. Thus, these platforms are very suitable for data hiding applications, such as covert com-munication, secret data keeping, etc. It is desired to propose a new method which is useful for covert communication or secure keeping of secret messages on collaborative writing platforms However, the aforementioned methods can only be applied to documents with single authors and single revision ver-sions, meaning that they are not suitable for hiding data on collaborative writing platforms. Therefore, the goal of this article is to propose a new data hiding method which can hide data into documents cre-ated on collaborative writing platforms. In more detail, a new data hiding method is proposed, which simulates a collaborative writing process to generate a fake document, consisting of an article and its revision history, as a camouflage for message bit embedding. As shown in Figure 1, with the input of an article and a secret message, the proposed method utilizes multiple virtual authors to collaboratively ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 2, Article 20, Publication date: February 2014.

(3)

issue common to them is how to find a nature way for simulating the writing process and how to obtain large-volume written data automatically. Hence, another goal of this article is to find a nature way to generate the revision history and to obtain large-volume collaborative writing data automatically. In recent years, some researches have been conducted to analyze the revision history data of Wikipedia articles for various natural language processing applications [Bronner and Monz 2012; Bronner et al. 2012; Dutrey et al. 2010; Erdmann et al. 2009; Max and Wisniewski 2010; Nelken and Yamangil 2008; Vi´egas et al. 2004], such as spelling corrections, reformulations, text summarization, user edits clas-sification, multilingual content synchronization, etc. In addition to being useful for these applications, the collaboratively written data in Wikipedia are also very suitable, as found in this study, for simulat-ing the collaborative writsimulat-ing process for the purpose of data hidsimulat-ing since it is the largest collaborative writing platform nowadays.

Liu and Tsai [2007], proposed a data hiding method via Microsoft Word documents by the use of the change tracking function, which embeds a secret message by mimicking a predraft document written by an author with an inferior writing skill and encoding the secret message by choices of degenera-tions in the writing. Although they used three databases for degeneradegenera-tions, the sizes of them are quite small when compared to that of the database constructed from Wikipedia which we make use for data embedding in this study. It is noted by the way that a data hiding method can, as well known, embed more bits by making use of a larger database. Furthermore, in Liu and Tsai [2007] a stego-document is generated by only two virtual persons and the change tracking data are made by the one with a better writing skill. This scenario is insufficient for simulating a normal collaborative writing process. There-fore, in this article we propose a new framework that uses the revision-history data from Wikipedia and simulates real collaborative writing processes to hide secret messages. Four characteristics of col-laborative writing processes are analyzed and utilized for message hiding, including the author of each revision, the number of corrected word sequences, the content of the corrected word sequences, and the word sequences replacing the corrected ones. The proposed method is useful for covert communication or secure keeping of secret messages on collaborative writing platforms.

In the remainder of this article, the idea of the proposed method is described in Section 2. Detailed algorithms for collaborative writing database construction, secret message embedding, and secret mes-sage extraction are given in Section 3. In Section 4, some experimental results are presented to show the feasibility of the proposed method, and in Section 5, we discuss the security issue, followed by conclusions in Section 6.

2. BASIC IDEA OF PROPOSED METHOD

Collaborative writing means an activity involving more than one author to create an article

coopera-tively on a common platform. The purposes of establishing a collaborative writing platform includes knowledge sharing, project management, data keeping, etc. Many collaborative writing platforms are available, such as Google Drive, Office Web Apps, Wikipedia, etc., which record revisions generated during the collaborative writing process. In general, the recorded information of a revision includes: (1) the author of the revision (2) the time the revision was made, and (3) the content of the revision.

To achieve the goal of creating camouflage revisions in collaborative writing for message hiding in this study, we analyze the existing revision-history data of articles on Wikipedia, which is the largest collaborative writing platform on the Internet currently in the world. The aim is to get real and large

(4)

Fig. 2. Flow diagram of the proposed method.

collaborative writing data contributed by people all over the world and use them to create more realistic

revision histories to enhance the resulting effect of data embedding. However, since the collaborative writing process is very complicated, it is hard to find a unified model to simulate it. Many different types of modifications may be made during the collaborative writing process [Bronner and Monz 2012; Dutrey et al. 2010], such as error corrections, paraphrasing, factual edits, etc. Moreover, different lan-guages usually require different models to represent due to their distinctive grammatical structures. Therefore, in order to get useful collaborative writing data automatically from the revision history data on Wikipedia without building models manually and to generalize a method that can be applied to multiple languages, we assume that only word sequence corrections occur during a revision. Some characteristics in collaborative writing based on this assumption for data embedding are identified, which will be discussed in the following. It is noted that various text articles, not only in English but also in other languages, can be utilized as cover media in this study.

The revision history of each article in Wikipedia is stored in a database, and one can recover any previous revision version of the article by an interface provided on the site. For this study, we have collected a large set of revision-history data from Wikipedia, and in the proposed method we mine this set to get useful information about word usages in the revisions. Then, we use the acquired information to simulate a collaborative writing process, starting from a cover article; and generate a stego-article with a sequence of revisions according to the secret message and a secret key. The resulting

stego-document, including the stego-article and the revision history, looks like a work created by a group

of real authors, achieving an effect of camouflage. In contrast, we call the original article with an initially-empty history a cover document in the sequel.

More specifically, the proposed method includes three main phases, as shown in Figure 2: (1) con-struction of a collaborative-writing database; (2) secret message embedding; and (3) secret message extraction. In the first phase, a large number of articles acquired from Wikipedia are analyzed and useful collaboratively written data about word usages are mined using a natural language process-ing technique. The mined data then are used to construct a database, called the collaborative writprocess-ing

database, denoted as DBcw subsequently. In the second phase, with the input of a cover document, a

secret message, and a secret key, a stego-document with a fake revision history is generated by sim-ulating a real collaborative writing process using DBcw. The revisions in the history are supposed to be made by multiple virtual authors; and the following characteristics of each revision are decided by the secret message: (1) the author of the revision; (2) the number of changed word sequences of the revision; (3) the changed word sequences in the revision; and (4) the word sequences selected from the collaborative writing database DB_cw, which replace those of (3), called the replacing word sequences in the sequel. And in the third phase, an authorized person who has the secret key can extract the secret message from the stego-document, while those who do not have the key cannot do so. They even ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 2, Article 20, Publication date: February 2014.

(5)

Fig. 3. Illustration of used terms and notations.

could not realize the existence of the secret message because the secret message is disguised as the revision history in the stego-document. Note that the second and third phases can be applied on any collaborative writing platforms, not just on Wikipedia; Wikipedia is merely utilized in the first phase to construct the collaborative writing database DBcwin this study.

3. DATA HIDING VIA REVISION HISTORY

In this section, the details of the proposed method for using the analyzed characteristics of collabora-tive writing to hide secret messages are described in the following, where the first part is collaboracollabora-tive writing database construction, the second part is secret message embedding, and the final part is secret message extraction.

3.1 Collaborative Writing Database Construction

To construct the aforementioned collaborative writing database DBcw, we try to mine the revision data collected from Wikipedia. There were about 4.2 million articles in the English Wikipedia in May 2013, which is a very large knowledge repository; therefore, it is suitable to use it as a source for constructing the database DB_cw desired in the study. Specifically, at first we downloaded part of the English Wikipedia XML dump with the complete revision histories of all the articles on August 3, 2011. Then, we mined the useful collaborative writing data from the downloaded data set under the assumption that only word sequence corrections will occur during a revision.

As illustrated in Figure 3, each downloaded article P has a set of revisions {D0, D1, . . . , Dn} in its revision history, where a newer revision Di has a smaller index i with D0 being the latest version of the article. For every two consecutive revisions Di and Di₋₁, we find all the correction pairs between

Di and Di−1, each denoted as<sj, sj>, where sj is a word sequence in revision Di and was corrected to become another, namely, sj, by the author of revision Di₋₁. Then, we collect all correction pairs so found to construct the database DBcw. For example, assume Di = “National Chia Tang University” and

Di−1 = “National Chiao Tung University.” Then, the correction pair <s1, s1> = <“Chia Tang”, “Chiao Tung”> is generated and included into DBcw.

Moreover, about the properties of correction pairs, it was observed that if the context of a word sequence sjin revision Diis the same as that of a word sequence sjin revision Di−1(i.e., if the preceding word of sj is the same as that of sj and the succeeding word of sj is the same as that of sj as well), then<sj, sj> is a correction pair. Accordingly, a novel algorithm is proposed in this study for finding

automatically all of the correction pairs between every two consecutive revisions for inclusion in DB_cw.

The algorithm is an extension of the longest common subsequence (LCS) algorithm [Bergroth et al. 2000]. The details are described in Algorithm 1.

(6)

ALGORITHM 1: Finding Correction Pairs

Input: two consecutive revisions Diand Di−1in the revision history of an article P.

Output: the correction pairs between Diand Di−1.

Stage 1—finding the longest common subsequence.

1) (Splitting revisions into word sets) Split Diand Di−1into two sets of words, W= {w1, w2, . . . , wn} and W= {w1, w2, . . . , wm}, respectively.

2) (Constructing a counting table by dynamic programming) Construct an n× m counting table T to record the lengths of the common subsequences of W and Was follows.

a) Initialize all elements in table T to be zero.

b) Compute the values of table T from the upper left and denote the currently-processed entry in T by T(x, y) with x= 1 and y = 1 initially.

c) If the content of wxis identical to that of wy, then let T(x, y)= T(x – 1, y – 1) + 1; else, let T(x, y) = max

(T(x – 1, y), T(x, y – 1)).

d) If x is not larger than n, then let x= x + 1 and go to Step 2c); else, if y is not larger than m, then let x = 1 and y= y + 1 and go to Step 2c); else, regard table T as being filled up and continue.

3) (Finding the longest common subsequence) Apply a backtracking procedure to table T, starting from T(m, n), to find the longest common subsequence L= {l1, l2, . . . , lt}, where each element liin L is a word common to W

and W.

Stage 2—finding the correction pairs.

4) (Finding the correction pairs) Starting from the first element l1of L with the currently-processed element in

L being denoted by lp, find the correction pairs as follows.

a) If the word sequence sjin Diwith its preceding and succeeding words being lpand lp+1, respectively, is

not empty and if the word sequence sjin Di−1with the same context condition is not empty, either, then

take<sj, sj> as a correction pair.

b) Increment p by 1 and go to Step 4) until p> t.

We run Algorithm 1 for every two consecutive revisions of all the articles downloaded from Wikipedia to obtain a large set of correction pairs and write them into the database DB_cw. Furthermore, we count the total number Ncpof times that each correction pair CP is so obtained, and call the number Ncpthe

correction count of CP. The correction counts are also kept in the database DBcwfor use in the proposed

data hiding process.

As a summary, we use a record in the database DBc_w to keep the following information about a correction pair <sj, sj>: (1) an original word sequence sj; (2) a new word sequence sj; and (3) the correction count Ncp of the pair. Moreover, we define a chosen set of a word sequence s in DBc_wto be the one which include all the correction pairs<s, s> with sas their identical new word sequences. For example, Table III (shown in Section IV) shows a chosen set of the word sequence “such as.”

3.2 Secret Message Embedding

In the phase of message embedding with a cover document D0 as the input, the proposed system is designed to generate a stego-document Dwith consecutive revisions{D0, D1, D2, . . . , Dn} by producing a previous revision Difrom the current revision Di−1repeatedly until the entire message is embedded, as shown in Figure 3 where the direction of revision generation is indicated by the green arrows. The stego-document D including the revision history{D0, D1, D2, . . . , Dn} then is kept on a collaborative writing platform, which may be Wikipedia or others. To simulate a collaborative writing process more realistically, we utilize the four aforementioned characteristics of revisions to “hide” the message bits into the revisions sequentially: (1) the author of the previous revision Di, (2) the number of changed word sequences in the current revision Di₋₁, (3) the changed word sequences in the current revision

Di−1, and (4) the replacing word sequences in the previous revision Di, as described in the following. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 2, Article 20, Publication date: February 2014.

(7)

Fig. 4. Illustration of encoding authors of revisions for data hiding.

3.2.1 Encoding the Authors of Revisions for Data Hiding. We encode the authors of revisions to hide

message bits in the proposed method. For this, at first we select a group of simulated authors, with each author being assigned a unique code a, called author a. Then, if the message bits to be embedded form a code aj, then we assign author aj to the previous revision Dias its author to achieve embedding of message bits aj into Di. For example, assume that four authors are selected and each is assigned a unique code a, as shown in Figure 4, respectively. If the message bits aj to be embedded is “01,” then Jessy with author code “01” is selected to be the author of the revision Di. Moreover, every revision of D0through Dnwill be assigned an author according to the corresponding message bits, and so an author can be assigned to conduct more than one revision or reversely no revision in the generated revisions.

3.2.2 Using the Number of Changed Word Sequences for Data Hiding. In the process of generating

the previous revision Difrom the current one Di₋₁, we select some word sequences in Di−1and changed them into other ones in Di. It is desired to use as well the number Ng of word sequences changed in this process as a message-bit carrier.

To implement this aim, at first we set on the magnitude of Nga limit Nctaken to be the maximum allowed number of word sequences in Di−1that can be changed to yield Di. This limitation makes the simulated step of revising Di−1to become Di look more realistic because usually not very many words are corrected in a single revision. Next, we scan the word sequences in the text of the current revision

Di−1 sequentially and search DBcw to find all the correction pairs<sj, sj> with sj in Di−1. Then, we collect all s_j in these pairs as a set Qr, which we call the candidate set of word sequences for changes

in Di−1. Finally, we select Ngword sequences in Qrto form a set Qcsuch that the binary version of the number Ngis just the current message bits to be embedded.

But for this process of using Ng as a message-bit carrier to be feasible, several problems must be solved beforehand, including (1) the dependency problem, (2) the selection problem, (3) the

consecutive-ness problem, and (4) the encoding problem, as described in the following.

3.2.2.1 The Dependency Problem. We say that two word sequences in Di−1 are dependent if some identical words appear in both of them, and changing word sequences with this property in Di−1 will cause conflicts, leading to a dependency problem which we explain by an example as follows.

As shown in Figure 5(a), Di−1 = “you are not wrong, who deem that my days have been a dream” and Qr includes 11 word sequences denoted as q1through q11, respectively. From Figure 5(a) we can see that the word sequences q2, q3, and q5 in Qr are dependent on the word sequence q4 because the intersection of each of the former three with the latter one is non empty. If we correct q4= “are not wrong” in Di−1 to be another, say “is right,” then the dependent word sequences q2, q3, and q5 in Di−1 cannot be selected and changed anymore because they include word sequences in q4 which have already been changed and disappeared. That is, any part of a changed word sequence cannot be changed again; otherwise, a dependency problem will occur.

To avoid this problem in creating Di from Di−1, we propose a two-step scheme: (1) decompose Qr into a set of lists, I= {I1, I2, . . . , Iu}, with each list Ii including a group of mutually dependent word

(8)

Fig. 5. Illustration of the dependency problem. (a) Revision Di−1and candidate set Qr where the dependent word sequences

are surrounded by red squares. (b) Set I that corresponds to the set Qrfor solving the dependency problem.

sequences (i.e., with every word sequence in each Ii being dependent on another in the same list) and every two word sequences in two different lists, respectively, in I being independent of each other; and (2) select only word sequences from different lists in set I and change them to construct a new revision. The details to implement the first step is described in Algorithm 2. After applying the first step on the set Qras shown in Figure 5(b), it will be transformed into I= {(q1), (q2, q3, q4, q5), (q6), (q7), (q8), (q9, q10), (q11)} where each pair of parentheses encloses a list of mutually dependent word sequences. With I ready, we can now select word sequences from distinct lists Iiin it, such as q1, q2, q6, and q9, to simulate changes of word sequences in revision Di−1without causing the dependency problem.

ALGORITHM 2: Secret Message Embedding

Input: a cover document D0with an article to be revised collaboratively, a binary message M of length t, a secret

key K, and a collaborative writing database DBc_wconstructed by Algorithm 1.

Output: a stego-document Dwith a revision history{D0, D1, D2, . . . , Dn}. Stage 1—message preparation and parameter determination.

1) (Message composition) Affix an s-bit binary version of t to the beginning of M to compose a new binary message M, where the value s is agreed by the sender and the receiver beforehand.

2) (Message encryption) Randomize Mto yield a new binary message Musing K.

3) (Parameter determination) Use K to decide randomly both the number Naof authors and the limit Ncon the

number Ngof word sequences to be changed in every revision.

4) (Author encoding) Use K to select Naauthors from those who were involved in works conducted on the

collaborative writing platform to form an author list Ia, and assign a unique na-bit code to each selected

author in Ia.

Stage 2—revision generation and message embedding.

5) (Message embedding and revision generation) Generate the previous revision Difrom the current revision Di−1repeatedly while embedding the binary message Mby running Algorithm 4 which was designed

according to the schemes described in Sections 3.2.1 through 3.2.4 and is shown in the Online Appendix with the inputs Di₋₁, M, K, and Iauntil all bits in Mare embedded, where i= 1 initially.

6) If message Mis not exhausted, then repeat the previous process to generate more revisions; otherwise, collect the finally-revised article and the history of all the revisions, D0through Dn, as a stego-document D;

and take Das the output for use on the collaborative writing platform.

3.2.2.2 The Selection Problem. It is desired to select word sequences for use in the simulated

revi-sions according to their usage frequencies in DBc_w, so that a more frequently-corrected word sequence ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 2, Article 20, Publication date: February 2014.

(9)

Fig. 6. Illustration of the selection problem. (a) Huffman codes for the word sequences and the message bits that are encoun-tered in the selection problem. (b) Dividing of the word sequences into groups to solve the selection problem.

has a larger probability to be selected, forging a more realistic revision. For this aim, following [Liu and Tsai 2007; Wayner 1992], we adopt the Huffman coding technique to create Huffman codes uniquely for the word sequences in Qraccording to their usage frequencies, and select word sequences with their codes identical to the message bits to be embedded. Specifically, according to a property of Huffman coding, the lengths of the resulting Huffman codes of word sequences are in reverse proportion to the usage frequencies of the word sequences. So a word sequence with a shorter Huffman code will have a larger probability to be selected, which can be computed as (1/2)L_{where L denotes the number of bits} of the code. That is, the use of Huffman coding indeed can achieve the aim of selecting word sequences in favor of those which are more frequently corrected in real cases.

But a problem arises here: after we select one word sequence qyin this way, qycannot be used in the revision again for encoding an identical succeeding code in the message because qy has already been changed into another word sequence, causing a problem which we call the selection problem. This problem comes partially from the unique decidability property of Huffman coding. To illustrate this problem, for the previous example as shown in Figure 5 again, the Huffman codes for word sequences

q1 through q11are shown in Figure 6(a), and the message bit sequence to be embedded currently is “100100. . .” with the first six bits being just two repetitions of the code “100.” For this, at first we select word sequence q4and change it into another in the revision because the first three message bits to be embedded, “100,” are just the code for q4(indicated by red color). After this, the next three message bits to be embedded are again the code “100” (the blue color of message bits in Figure 6(a)); however, the corresponding word sequence q4cannot be selected any further because it has already been changed in the current revision version, and other word sequences cannot be selected, either, because their codes are not the same as the current message bits “100” to be embedded.

To solve this selection problem, suppose that based on the use of a key, we randomly assign the word sequences in Qr consecutively into Ng groups G1 through GNg, each group including multiple,

but distinct, word sequences, where Ngis the number of word sequences changed in Di−1. Then, start-ing from group G1, we apply Huffman coding to assign codes to all word sequences in the currently processed group Gkaccording to their usage frequencies, and select a word sequence in Gkwith its as-signed code identical to the leading message bits for use in the revision. We apply this step repetitively until all groups are processed. In this process, Huffman coding is applied to each Gk with word se-quences distinct from those in the other groups, so that the selection problem of choosing a word twice to change due to code repetition in the message will not happen any more. For example, as shown in Figure 6(b), Qris divided into three groups: G1, G2, and G3, represented by red, blue, and green colors, respectively. Starting from G1, we assign Huffman codes to the elements in each group as shown in Figure 6(b). Then, q2will be selected because the code of q2is the same as the first three bits “100. . .”

(10)

Fig. 7. Illustration of the consecutiveness problem. (a) An example for illustration of the consecutiveness problem. (b) Choosing splitting points randomly to solve the consecutiveness problem.

of the message to be embedded. Then, next in G2, q8 will be selected because the message bits to be embedded are currently “100. . .” Finally, q11in G3 will be selected because the current message bits to be embedded are “0. . .” In this way, the previous problem of being unable to embed the repetitive code “100” is solved automatically. In short, by decomposing randomly the candidate set Qr of word sequences for changes into groups and representing each group by a Huffman code, we can embed message bits sequentially by changing only one word sequence in each group without causing the selection problem.

However, the given process is insufficient; it must be modified in such a way that word sequences which have mutual dependency relations are divided into an identical group in order to avoid the dependency problem as discussed in Section 3.2.2.1. For this aim, instead of decomposing the word sequences in Qrdirectly into random groups as mentioned previously, we divide randomly the mutually

independent list elements of I mentioned in Section 3.2.2.1 into Nggroups, where each group is denoted

by GIk. Then, we take out all the word sequences in the lists in each GIkto form a new group of word sequences, denoted as Gk, resulting again in Nggroups of word sequences. For instance, for the previous example as shown in Figure 5, let Ng = 2 and suppose that the list elements of I are decomposed randomly into two groups: GI 1 = {I1, I2, I3, I4} and GI 2 = {I5, I6, I7}. Then, this procedure will yield the two groups of G1= {q1, . . . , q7} and G2= {q8, q9, q10, q11}.

3.2.2.3 The Consecutiveness Problem. As shown in Figure 7(a), for example, the word sequence

“in-crease in” in revision Di−1is seen to become “improve themselves” in revision Di. This effect comes from two changes made during message embedding: the word sequence “increase” in Di₋₁was changed to be “improve” in Di; and the word sequence “in” in Di−1was changed to be “themselves” in Di. However, because of the consecutiveness of the two words “improve” and “themselves” in Di, the two changes might be considered as a single one during secret message extraction, that is, the word sequence “in-crease in” in Di−1 might be regarded to have been changed to be “improve themselves” in Di. This ambiguity causes a problem, namely, we cannot know whether a change from a word sequence in Di−1 to be another in Di is from one group or two, or equivalently, we cannot know the true number Ngof changed word sequences in Di−1, so that we cannot extract later the embedded messages bits correctly. We call this difficulty in message extraction a consecutiveness problem.

Obviously, word sequences in different groups must be made nonconsecutive in order to solve the problem. For this aim, the previously-mentioned solution to the selection problem is modified further. Specifically, by the use of a key again we choose randomly Ng− 1 lists, say Ii1, Ii2,. . ., IiN (with N = Ng−1), of the set I for use as splitting points to divide I into Nggroups with Ii1through IiNnot included

in any of the Nggroups. For instance, let Ng = 2 for the previous example as shown in Figure 5 and ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 2, Article 20, Publication date: February 2014.

(11)

cause no consecutiveness problem.

3.2.2.4 The Encoding Problem. The issue up to now is how to determine the aforementioned

num-ber Ng of word sequences to be changed in Di−1. Although a limit Nc is set for Ng, the maximum number Nmof word sequences that can be selected in Di−1 may even be smaller than Nc. Therefore, we must compute Nmfirst before we can embed message bits according to the number Ng. After Nmis decided, Ngmay then be taken to be a number not larger than Nm. The actual value of Ngis decided by the leading secret message bits, say nmones. Consequently, we may assume that Nmsatisfies the two constraints of 1) Nm= 2nmand 2) 1≤ Nm≤ Nc, where nmis a positive integer. In addition, in order to embed message bits by selecting a word sequence from a group Gk, the number of elements in Gk should not be smaller than two so as to embed at least one message bit by Huffman coding; hence, each group GIkmentioned previously should be created to include at least two elements of I. Accordingly, the maximum number Nmof word sequences to be changed in Di₋₁ can be figured out to satisfy the following formula:

[NI− (Nm− 1)]/Nm≥ 2, (1)

where NI is the number of elements in set I and Nm – 1 represents the aforementioned number of chosen splitting points. The inequality (1) can be reduced to

Nm≤ (NI+ 1)/3. (2)

Accordingly, we can compute Nmby the following rule:

if (NI+ 1)/3 >Nc, set Nm= Nc; (3)

if 1≤ (NI+ 1)/3 ≤ Nc, set Nm= 2log2(NI+1)/3.

Furthermore, the content of Di₋₁might be too little for Nmto be decided by Equation (3). In that case, we abandon the original cover document D0from which Di−1is generated, and use another longer cover document as the input. After the value of Nmis computed, we can then use the leading nmbits of the message to decide the number Ng of changed word sequences in Di−1 by two steps: (1) express the first nmmessage bits as a decimal number; and (2) increment the decimal number by one. The second step is required to handle the case that the first nm message bits are all zeros, which leads to the undesired result of no word sequence being changed in the current revision. In this way, Ngbecomes really a carrier of nmmessage bits. For example, the number of elements of the set I for the previously-mentioned example as shown in Figure 5 is NI= 7. Let Nc= 4. Because (NI+1)/3 = (7+1)/3 ≈ 2.67 ≤ 4 = Nc, Nmis computed to be 2log2(7+1)/3 = 21according to Equation (3). So, nm = log2Nm= 1. And if the secret message is “101001. . .,” then the number Ngof changed word sequences should be taken, according to these two steps, to be Ng= (1)2+ (1)10= 2 because the first bit of the secret message is “1.” 3.2.3 Encoding the Changed Word Sequences in the Current Revision for Data Hiding. According

to the previous discussions, we may assume that we have computed the number Ngof word sequences which should be changed in the current revision Di−1 according to the first nmbits of the secret mes-sage, and that we have classified the available word sequences in Qrinto Nggroups, where each group

Gkincludes at least two word sequences and all word sequences in Gkare encoded by Huffman coding according to their usage frequencies. Specifically, the usage frequency of a word sequence s_jis taken to be the summation of the correction counts of all the correction pairs in the chosen set of s_j, which have

(12)

s_j as their common new word sequence. Then, starting from G1, we may select from each group Gk one word sequence with a Huffman code identical to the leading bits of the message to be embedded, achieving the goal of data hiding via changing word sequences in Di−1.

For example, assume that the usage frequencies of the word sequences in group G2 as shown in Figure 7(b) are: q9= 100, q10= 50, and q11= 150; and the message is “10100. . . .” Then, the Huffman codes assigned to q9, q10, and q11are “01,” “00,” and “1,” respectively; and so we select q11to hide the first bit “1” of the message because the code of q11is “1.”

3.2.4 Encoding the Replacing Word Sequences in the Previous Revision for Data Hiding.

Symmet-rically, we may use as well the replacing word sequences in Di to embed message data, where each replacing word sequence sj in Di corresponds to a changed word sequence sj in Di−1, forming a correc-tion pair<sj, sj>. Specifically, recall that for each sj, we can find a chosen set of correction pairs from

DBcw. From this set, we can collect all the original word sequences of the correction pairs as another set Q_c, with each word sequence in Q_c being appropriate for use as the replacing word sequence sj. Let Q_c= {s1, s2, . . . , sw}. Then, to carry out message data hiding, we encode all sj in Qc by Huffman coding according to their usage frequencies as well, and choose the one with its code identical to the leading message bits for use as the word sequence sj replacing s_j. Here the usage frequency of each

sj is the correction count of the correction pair<sj, sj>. For example, Table III shows the chosen set of the word sequence “such as” with all included original word sequences already assigned Huffman codes according to their usage frequencies. Based on the table, if the message to be embedded currently is “01001001. . . ,” then we change the word sequence “such as” in the current revision Di−1 to be the word sequence “for example” in the previous revision Di because the Huffman code for “for example,” namely, 0100, is the same as the first four bits of the secret message.

3.2.5 Secret Message Embedding Algorithm. As a summary, we have demonstrated the usability

of the aforementioned four characteristics of revisions for data hiding. Therefore, we can generate a stego-document with a forged revision history which looks like a realistic work written by people collaboratively. The details of the proposed message embedding process are described in Algorithm 2. 3.3 Secret Message Extraction

We can extract the secret message in the stego-document by a reverse version of the message embed-ding process described by Algorithm 2. The details are described by Algorithm 3 in the following. 4. EXPERIMENTAL RESULTS

A collaborative writing database DBcwwas constructed by mining the huge collaborative writing data in Wikipedia using Algorithm 1 described previously. Note that this is a totally automatic work and need be performed only once for building the database DBcwusing Algorithm 1, where 3,446,959 differ-ent correction pairs were mined from 2,214,481 pages with 33,377,776 revisions in English Wikipedia XML dump. The total size of the downloaded Wikipedia data is about 210.3 GB and the size of the mined data is just 888 MB. Moreover, some revisions might suffer from vandalism [Bronner and Monz 2012; Dutrey et al. 2010], and by the method proposed by Bronner and Monz [2012], such revisions were ignored if they have been reverted due to vandalism. Also, keywords in Wiki markup1_were ig-nored as well. Table I shows the top 20 most frequently used correction pairs, where the one in the first place is the pair<“BCE”, “BC”> with a correction count of 19,430. Table II shows some correction pairs, each having more than one word either in its original word sequence or in its new word sequence. One of the correction pairs in this table is<“like”, “such as”> with a correction count of 773.

1_{http://en.wikipedia.org/wiki/Help:Wiki markup.}

(13)

color colour 15,356 a the 6,278

colour color 14,852 are is 5,430

The the 14,232 colors colours 5,301

a an 9,792 colours colors 5,078

it’s its 9,658 CE AD 4,833

is was 9,607 AD CE 4,262

an a 8,954 image Image 4,259

was is 7,407 was were 3,924

ALGORITHM 3: Secret message extraction

Input: a stego-document Dincluding revision history{D0, D1, D2, . . . , Dn} of a collaboratively-revised article, the

secret key K used in Algorithm 2, and the database DBcwconstructed by Algorithm 1.

Output: a binary message M. Stage 1—preparation.

1) (Parameter determination and author encoding) Use K to decide randomly the number Naof authors, the

limit Ncon the number Ngof word sequences to be changed in every revision, and the list Iaof Naauthors, in

the same way as Steps 3 and 4 of Algorithm 2. Stage 2—encrypted message extraction.

2) (Message bit extraction) For each revision Di−1with i= 1 initially, extract the binary message m by running

Algorithm 5 which is essentially a reverse version of Algorithm 4 and shown in the Online Appendix with the inputs Di−1, Di, K, and Ia; and append the result m to a bitstream Muntil i> n where Mis set empty

initially.

Stage 3—message content recovery.

3) (Message decryption) Decrypt the bitstream Mto get Musing K.

4) (Message extraction) Express the first s bits of Min decimal form as t and output the (s+ 1)th through (s+ t)th message bits of Mas the secret message M.

The constructed database DBc_w contains 1,688,732 chosen sets of correction pairs where all the correction pairs in a chosen set have identical new word sequences, meaning that there are 1,688,732 word sequences which can be chosen and changed to other word sequences in the message embedding phase. Figure 8 shows an illustration of the numbers of entries in the chosen sets with sizes from 2 to 40. Table III shows the content of a chosen set with the new word sequence “such as,” as well as the usage frequency and Huffman code for each original word sequence which may be replaced by “such as” during message embedding. From the table, we can see that the most frequently used original word sequence is “like,” so it has the shortest code “1” and the largest probability to be chosen.

After the message embedding phase, the proposed system will generate a stego-document to be kept in a collaborative writing platform and a user can later extract the embedded message from it using a key. Each generated stego-document including its revision history was kept on a Wiki site which

(14)

Table II. Some Correction Pairs Each with More Than One Word Either in the Original Word Sequence or in the New Word Sequence

Original word New word Usage Original word New word Usage

sequence sequence frequency sequence sequence frequency

Irish evil Evil 2,367 due to because of 933

Evil Irish evil 2,253 like such as 773

US United States 1,094 didn’t did not 665

It’s It is 1,052 passed away died 374

due to the fact that because 359 doesn’t does not 489

have been were 348 WWII World War II 395

will be was 903 UK United Kingdom 599

Fig. 8. The number of entries of chosen sets with the size from 2 to 40. Table III. An Example of a Chosen Set with the New Word Sequence “such as”

Original word Original word

sequence Usage frequency Huffman code sequence Usage frequency Huffman code

like 773 1 specifically 12 011001

including 143 00 namely 10 011000

for example 39 0100 particularly 10 0111111

of 29 01110 like the 10 010100

notably 23 01101 most notably 10 010101

especially 20 01011 include 9 0111110

and 16 011110

was constructed in this study using the free software: MediaWiki2_{. Note that though here the} pre-selected collaborative writing platform is the constructed Wiki site, yet the proposed method can be used on other collaborative writing platforms as well. As an example, with a cover article as shown in Figure 9(a), the message “Art is long, life is short,” and the key “1234” as inputs into Algorithm 2, a stego-article as shown in Figure 9(c) together with a revision history as shown in Figure 9(b) was generated by the proposed method. We can see from Figure 9(b) that five revisions have been created in order to embed the secret message. And Figure 9(d) and Figure 9(e) show the extracts of the differences between the two newest revisions, where the words in red in Figure 9(d) were corrected to be those in red in Figure 9(e) by the author “Natalie.” Figure 9(f) and Figure 9(g) shows respectively the messages extracted by Algorithm 3 using a right key and a wrong one. These results show that when a user uses a wrong key, the system will return a random string as the message extraction result.

2_{http://www.mediawiki.org/wiki/MediaWiki.}

(15)

Fig. 9. An example of generated stego-documents on constructed Wiki site with input secret message “Art is long, life is short.” (a) Cover document. (b) Revision history (c) Stego-document. (d) Previous revision of revision of (e) with words in red being those corrected to be new words in revision of (e) in red. (e) Newest revision of created stego-document. (f) Correct secret message extracted with the right key “1234.” (g) Wrong extracted secret message with a wrong key “123.”

Table IV. The Information of Experimental Documents

Document Character Word Sentence Document Character Word Sentence

Document 1 2,419 641 80 Document 4 11,215 2,617 86

Document 2 4,762 956 45 Document 5 26,591 6,180 631

Document 3 10,128 2,211 121 Document 6 60,349 14,306 1,603

A series of experiments with different parameters have also been conducted to quantitatively mea-sure the data embedding capacity of the proposed method using a lot of cover documents as inputs. Since the data embedding capacity is dependent on the secret message content which influences the selections of authors and changed word sequences for each revision, we have run experiments for each document ten times using different messages as inputs, and recorded the average of the resulting data embedding capacities. The parameters of six different cover documents are shown in Table IV. For example, document 1 has 2,419 characters, 641 words, and 80 sentences; document 3 has 10,128 characters, 2,211 words, and so on.

In these experiments, first we selected the replacing word sequences for a revision to be the top n most frequently used ones in the database DBcw, where n = 2, 4, 8, 16, 32. Figure 10(a) shows the resulting data embedding capacities from which we can see that the more the selected replacing word sequences, the more the embedded message bits. This result comes from the fact that when more replacing word sequences are available, the constructed Huffman codes will become longer.

We have also conducted experiments on using different numbers of revisions (1, 2, 4, 8) in the gen-erated stego-documents to see the resulting data embedding capacities. Figure 10(b) shows the results which indicate that when the number of revisions in the stego-document is larger, more message bits can be embedded, as expected. This means that if we want to embed a larger secret message, more re-visions should be generated. Yet, on a Wiki site, each revision will be stored as its original text without any compression. Thus, a larger storage space is required to store more generated revisions when the

(16)

Fig. 10. The embedding capacities. (a) Embedding capacities of documents with chosen sets of different sizes. (b) Embedding capacities of documents with different number of revisions.

Fig. 11. Comparison of embedding capacities yielded by Liu and Tsai [2007] and proposed method using different numbers of revisions.

secret message is longer. However, one can solve this issue by simply comparing the difference between two adjacent revisions and only storing the difference between them where this comparison function may be provided by other collaborative writing platforms if desired. Furthermore, we can see also from Figures 10(a) and 10(b) that when a cover document has a larger size, the resulting data embedding capacity will be larger as well. Thus, if we want to embed more data, we have to choose a larger cover document.

Figure 11 shows a comparison of the resulting embedding capacities yielded by the proposed method with those yielded by Liu and Tsai’s method [2007]. We can see from Figure 11 that when the number of revisions of the proposed method is equal to one, the embedding capacity of the proposed method is very close to that yielded by Liu and Tsai [2007]. Note that not every word sequence in the current re-vision Di₋₁can be utilized for data embedding in the proposed method, because we limit the maximum number of corrected word sequences in a revision. Thus, when the number of revisions is just one, the embedding capacity of the proposed method may not be better than that of Liu and Tsai [2007] which allows the use of every word for message embedding. However, when the number of revisions is equal to or greater than two, the embedding capacities of the proposed method are instead much larger.

Like the methods proposed by [Bronner and Monz 2012; Bronner et al. 2012] which can be utilized for multiple languages, we have tried to apply Algorithm 1 to two adjacent revisions of a Chinese document and obtain the correction pairs for them successfully, where the two revisions are shown in Figure 12. Note that since Chinese has no explicit word segmentation mark, we cannot use spaces to split an article in Chinese into words. Therefore, each character in Chinese was treated as a word directly to solve the issue. Figure 12 shows the found correction pairs between the two revisions, in which, for instance, one of the found correction pairs is , , where both word sequences in the pair mean the same as “achieve” in English.

(17)

Fig. 12. An example to show the interoperability of the proposed method, which can be applied on Chinese articles. Table V. Comparison of Methods for Data Hiding via Texts

Database Embedding # of # of Usage frequencies

Method Utilized database construction capacity revisions authors of word sequences

Chapman et al. [2001] Synonym dictionary Manually Limited 1 1 −

Bolshakov [2005] Synonym dictionary Manually Limited 1 1 −

Shirali-Shahreza and

Shirali-Shahreza [2008] Synonym dictionary Manually Limited 1 1 −

Synonym dictionary₊

Liu and Tsai [2007] Small collaborative Mainly Limited 2 2 Simulated

writing database manually

Proposed method Large collaborative

writing database Automatically Unlimited Unlimited Many Real data

Moreover, for the purpose of presenting the contributions made by the proposed method, we have compared it with several other methods for data hiding via texts [Bolshakov 2005; Chapman et al. 2001; Shirali-Shahreza and Shirali-Shahreza 2008; Liu and Tsai 2007] as shown in Table V. First, the synonym replacement methods [Bolshakov 2005; Chapman et al. 2001; Shahreza and Shirali-Shahreza 2008] utilize synonym dictionaries to embed messages, where the synonym dictionaries were usually manually built by language experts. And the embedding capacities of these methods are lim-ited, since only those word sequences in the cover document which exist in the synonym dictionary can be utilized for data embedding. Also, since they replace the word sequences in a cover document into their synonyms, the resulting stego-document is usually a worse version of the original cover doc-ument due to the possible losses of the original meanings in the replacements. Furthermore, the usage frequencies of the corresponding synonyms of a word sequence are not analyzed in these methods. Secondly, the change tracking method proposed by Liu and Tsai [2007] utilizes synonym dictionaries and a small collaborative writing database with only 7,581 chosen sets to embed messages, where the synonym dictionaries were built manually as well. Also, the embedding capacities of this method is lim-ited, since only two revisions are generated by two authors and only the word sequences in the cover document are degenerated for data embedding. Moreover, the usage frequencies of word sequences of this method are just a simulated one created by using the Google SOAP Search API.

As a summary, several merits of the proposed method can now be pointed out, which include: (1) the database of the proposed method is constructed automatically from Wikipedia, which is the largest collaborative writing platform on the Internet; therefore, the resulting stego-document generated by

(18)

the proposed method is more realistic than that generated by the other four methods [Bolshakov 2005; Chapman et al. 2001; Shirali-Shahreza and Shirali-Shahreza 2008; Liu and Tsai 2007]; (2) the dababase constructed by the proposed method is much larger than that by Liu and Tsai [2007], with 1,688,732 chosen sets in the former and only 7,581 in the latter; (3) the usage frequency of each correc-tion pair used in the proposed method is a real parameter obtained by mining the collaborative writing data found on Wikipedia, but that of Liu and Tsai [2007] is just a simulated one created by using the Google SOAP Search API; and (4) the proposed method can simulate the collaborative writing process conducted by multiple authors and revisions, but Liu and Tsai [2007] can only generate one pre-draft version of a cover text, simulating the work of two authors. Thus, to the best of our knowledge, this is the first work that can simulate the real collaborative writing process with multiple authors and revi-sions by mining the revision histories on Wikipedia or similar platforms and using the characteristics in the collaborative writing process effectively for message embedding.

Furthermore, to illustrate the usability of the proposed method in the real world, it is pointed out that one can build a collaborative writing platform, such as a Wiki site, for uses by a school, company, or government and then implement the proposed method on this platform. For example, for a school, especially with a large size, the teachers may establish a big wiki site with many documents for general teaching, administration, and communication uses, which are accessible by teachers, staff members, students, parents, etc. Sometimes, a teacher might want to communicate with a student’s parents in a secret way. Then, the wiki site may be used as a platform for such covert communication of messages. In addition, the teacher may keep secret records of the students on the wiki site using the data embedding schemes provided by the proposed method. That is, a collaborative writing platform can not only let people work collaboratively but also can let people hide message into the documents existing on this platform for applications of covert communication and secret data keeping.

5. SECURITY CONSIDERATION 5.1 Camouflage

In the proposed method, we collected collaborative writing data in Wikipedia written by real people to construct the database DB_cwfor use in message embedding. Therefore, the stego-document created using DB_cwis more robust to attacks by malicious users since the stego-document looks like a realistic work completed by multiple virtual authors on a collaborative writing platform. These authors do not actually edit these revisions and so are regarded as virtual authors. These virtual authors are created to simulate the real-world authors and used to embed messages to avoid the problem of involving real authors who might leak the secret. Also, to increase the realisticness of the created stego-document, the content of the corrected word sequences in a revision and the word sequences replacing the corrected ones are selected according to the real usage frequencies mined from the collaborative writing data in Wikipedia. Thus, the statistical property of simulated corrections in the generated stego-document is close to that of a real one.

Moreover, in order to increase the camouflage effect of the stego-document created by the proposed method, two additional ways can be adopted. The first is to change the time of editing for each gen-erated revision in a stego-document to make it fit the model of revision time in reality, such as the analyzed patterns of revision history mentioned in Vi´egas et al. [2004]. This can be achieved by using a key to select randomly a time for each revision in a possible time duration between the related pair of adjacent revisions. The second way is about the selection of authors for data embedding. If an author makes more realistic corrections in his/her revision history of creating a stego-document, then inclu-sion of him/her as one of the collaborative authors will cause less conspicuousness to adversaries. This idea can be implemented simply by pre-generating some revision data of virtual authors who looks ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 2, Article 20, Publication date: February 2014.

(19)

more types of corrections, such as paraphrases and factual edits, to mislead the adversary, where these extra corrections will be ignored during message extraction.

5.2 Randomness

According to Kerckhoffs’s principle [Kerckhoffs 1883], it may be assumed that an adversary, who un-derstands the system but does not have the secret key, can obtain no information about the embedded message. By using the key to enhance the security of the proposed technique, some randomness mea-sures in the phases of secret message embedding and secret message extraction are adopted in the proposed method: (1) randomization of the bits of the secret message to be embedded by encryption; (2) randomization of the parameters and author encoding, including the number of authors, the maxi-mum allowed number of word sequences changed in the revision, the author list, and the author codes; and (3) randomization of the selections of the splitting points for each revision.

More specifically, in the first measure, the secret message is randomized though encryption by using the key, where the encryption method we adopted is AES-256. The Advanced Encryption Standard (AES) is one of the most popular ciphers and provides very high security; the public known attacks up to now have all been shown to be computational infeasible [Bogdanov et al. 2011; Biryukov and Khovratovich 2009]. In the second measure, the parameters (the number of authors and the maxi-mum allowed number of word sequences changed in the revision) and the author encoding (the author list and the author code for each author) are decided by the key and some pseudo-random number generators. In the third measure, for each revision, Ng− 1 lists of the set I for use as the splitting points are selected randomly by the key and a pseudo-random number generator. Let the resulting stego-document Dinclude revision history{D0, D1, D2, . . . , Dn} with Naauthors, the size of the set I of word sequences for selection in each revision Dkbe Ik, and the number of word sequences changed in each revision Dkbe Ngk. Then, for an adversary who does not have the key, he/she needs to execute

Algorithm 3 for all possible combinations of word splitting points of the revisions and the author codes, and observe the result to check the correctness of the encrypted secret message. The time complexity for this work is of the order of (Na!)× [

n−1

k=0C(Ik, Ngk− 1)] which is a very big number, where C(a, b)

means the combination of a things taken b at a time without repetition. Moreover, it is very hard for an adversary to decide which result yielded by the algorithm is correct because the secret message is encrypted by AES-256 and looks like random noise. Therefore, the proposed method is expected to be secure for secret message hiding.

Additionally, the collaborative writing database may be available to adversaries since they can re-construct the collaborative writing database by using the same Wikipedia data and the same algo-rithms as those proposed in this study. To increase the security against this type of attack, one addi-tional way to increase the robustness of the proposed method is to use the key to decide the subset of a chosen set and select a word sequence from the subset. Therefore, only authorized users with the key can know the correct subset of the chosen set, and an adversary cannot.

5.3 Possible Extensions for the Proposed Method Using Natural Language Processing Methods

For the ability of constructing the collaborative writing database automatically and generalizing the proposed method for multi-language uses, four characteristics of collaborative writing as mentioned previously have been analyzed based on the assumption that only word sequence corrections will be

(20)

made in a revision. However, the real collaborative writing process is much more complicated and language-dependent, so data hiding via collaborative writing is still worth intensive researches.

Many possible methods in natural language processing [Bronner and Monz 2012; Bronner et al. 2012; Dutrey et al. 2010] may be applied to extend the proposed method. For example, some original word sequences in an input cover document may be polysemous. Therefore, selecting appropriate word sequences from DBc_wby the proposed method to replace such polysemous word sequences might con-stitute a meaningless context. One possible way out is to analyze the distributional similarity of word sequences [Madnani and Dorr 2010] to find appropriate replacing word sequences that do not cause this problem, where distributional similarity means the similarity in the meanings of those words that have the same contexts in documents. Moreover, we can also build language models [Bronner and Monz 2012; Bronner et al. 2012; Dutrey et al. 2010], such as dependency trees used in grammatical analysis, to embed messages during revision generations based on the model.

6. CONCLUSIONS

A new data hiding method via creations of fake collaboratively written documents on collaborative writing platforms has been proposed. An input secret message is embedded in the revision history of the resulting stego-document through a simulated collaborative writing process with multiple virtual authors. With this camouflage, people will take the stego-document as a normal collaborative writing work and will not be expected to realize the existence of the hidden message. To generate simulated revisions more realistically, a collaborative writing database was mined from Wikipedia, and the Huff-man coding technique was used to encode the mined word sequences in the database according to the statistics of the words. Four characteristics of article revisions were identified, including the author of each revision, the number of corrected word sequences, the content of the corrected word sequences, and the word sequences replacing the corrected ones. Related problems arising in utilizing these char-acteristics for data hiding have been solved skillfully, resulting in an effective multi-way method for hiding secret messages into the revision history. Moreover, because the word sequences used in the revisions were collected from a great many of real people’s writings on Wikipedia, and because Huff-man coding based on usage frequencies is applied to encode the word sequences, the resulting stego-document is more realistic than other text steganography methods, such as word-shift methods [Kim et al. 2003], nondisplayed characters based methods [Lee and Tsai 2010], synonym replacement meth-ods [Bolshakov 2005; Chapman et al. 2001; Shirali-Shahreza and Shirali-Shahreza 2008], etc. The experimental results have shown the feasibility of the proposed method. Future works may be directed to analyzing more characteristics of collaborative writing works or establishing appropriate language models [Bronner and Monz 2012; Bronner et al. 2012; Dutrey et al. 2010] for more effective data hiding or other applications.

ACKNOWLEDGMENT

The authors would like to thank the associate editor, Prof. Pradeep Atrey, and the anonymous review-ers of this article for their helpful comments as well as suggestions for improving the organization of the article and the possible ways for improving and extending the proposed method.

REFERENCES

A. M. Alattar and O. M. Alattar. 2004. Watermarking electronic text documents containing justified paragraphs and irregular line spacing. In Proceedings of the Security, Steganography, and Watermarking of Multimedia Contents VI. E. J. Delp III and P. W. Wong, Eds., SPIE, vol. 5306, 685–695.

K. Bennett. 2004. Linguistic steganography: Survey, analysis, and robustness concerns for hiding information in text, CERIAS Tech. rep. 2004-13, Purdue Univ., West Lafayette, IN.