Secret message extraction - Data Hiding via Revision History

Chapter 4 A New Data Hiding Technique via Revision History Records on

4.3 Data Hiding via Revision History

4.3.3 Secret message extraction

We can extract the secret message in the stego-document by a reverse version of the message embedding process described by Algorithm 4.2. The details are described as an algorithm in the following.

Algorithm 4.3. Secret message extraction.

Input: a stego-document D′ including revision history {D0, D1, D2, …, Dn} kept on a pre-selected collaborative writing platform, the secret key K used in Algorithm 4.2, and the database DBcw constructed by Algorithm 4.1.

Output: a binary message M of length t.

Stage 1  parameter determination.

Step 1. Use key K to decide two parameters Na and Nc, and construct a list Ia of Na

uniquely encoded authors, in the same ways as described in Step 4 of Algorithm 4.2.

Stage 2  message data extraction.

Step 2. (Message extraction) For each revision Di–1 with i = 1 initially, extract the embedded message bitstream M′′ by the following steps until i > n where M'' is set empty initially.

Stage 2.1  extracting message data from the author.

(a) (Extracting message bits from the author code) Find the author aj of the Algorithm 4.2 to get the candidate set Qr of word sequences for changes in Di1.

(c) (Finding the independent word sequence lists in Qr) Perform Step 5.c of Algorithm 4.2 to get the set I of lists of independent word sequences from Qr.

(d) (Finding the correction pairs between Di and Di–1) Perform Algorithm 4.1 with the previous revision Di and the current revision Di–1 as the input to

(ii) express the number of elements in set CP, which is also the total number Ng of the changed word sequences in Di1, as an nm-bit binary number

(g) (Choosing splitting points) Choose randomly Ng – 1 elements of I as splitting points using the key K in the same ways as described in Step 5.e of Algorithm 4.2.

(h) (Classifying word sequences into independent sets) Perform Step 5.f of Algorithm 4.2 to classify the elements of I into Ng groups.

(i) (Choosing changed word sequences for message extraction) For each group Gk created in the last step with k = 1 initially, encode its word sequences by Huffman coding according to their usage frequencies, and for each word database DBcw, and collect all the original word sequences in the correction pairs of the chosen set as a set Qc'.

(k) (Extracting the code of replacing word sequences) Encode the word sequences in Qc' by Huffman coding according to their usage frequencies, and for each word sequence sj in Qc', check whether sj is identical to an original word sequence wo in CP  if so, then append the Huffman code of sj to M′′ and go to Step 2.i with k increased by one until k > Ng; else, check the next word sequence in Qc' repeatedly.

Stage 3  message decryption and extraction.

Step 3. (Message decryption) Decrypt the bitstream M′′ to get M' using the key K.

Step 4. (Message extracting) Express the first s bits of M' in decimal form as t and output the (s + 1)th through (s + t)th message bits of M' as the secret message M.

4.4 Experimental Results

A collaborative writing database DBcw was constructed by mining the huge collaborative writing data in Wikipedia using Algorithm 4.1 described previously.

Note that this is a totally automatic work and need be performed only once for building the database DBcw using Algorithm 4.1, where 3,446,959 different correction

pairs were mined from 2,214,481 pages with 33,377,776 revisions in English Wikipedia XML dump. The total size of the downloaded Wikipedia data is about 210.3 GB and the size of the mined data is just 888 MB. Moreover, some revisions might suffer from vandalism [57], [59], and by the method proposed by Bronner and Monz [57], such revisions were ignored if they have been reverted due to vandalism.

Also, keywords in Wiki markup¹ were ignored as well. Table 4.1 shows the top 20 most frequently used correction pairs, where the one in the first place is the pair

<“BCE”, “BC”> with a correction count of 19,430. Table 4.2 shows some correction pairs, each having more than one word either in its original word sequence or in its new word sequence. One of the correction pairs in this table is <“like”, “such as”>

with a correction count of 773.

Table 4.1.Top twenty frequently used correction pairs.

Original pairs where all the correction pairs in a chosen set have identical new word sequences, meaning that there are 1,688,732 word sequences which can be chosen and changed to other word sequences in the message embedding phase. Figure 4.11 shows an

1 http://en.wikipedia.org/wiki/Help:Wiki_markup

illustration of the numbers of entries in the chosen sets with sizes from 2 to 40. Table 4.3 shows the content of a chosen set with the new word sequence “such as,” as well as the usage frequency and Huffman code for each original word sequence which may be replaced by “such as” during message embedding. From the table, we can see that the most frequently used original word sequence is “like,” so it has the shortest code

“1” and the largest probability to be chosen.

Table 4.2. Some correction pairs each with more than one word either in the original word sequence or in the new word sequence.

Original word

Figure 4.11. The number of entries of chosen sets with the size from 2 to 40.

extract the embedded message from it using a key. Each generated stego-document including its revision history was kept on a Wiki site which was constructed in this study using the free software: MediaWiki². Note that though here the pre-selected collaborative writing platform is the constructed Wiki site, yet the proposed method can be used on other collaborative writing platforms as well. As an example, with a cover article as shown in Figure 4.12(a), the message “Art is long, life is short,” and the key “1234” as inputs into Algorithm 4.2, a stego-article as shown in Figure 4.12(c) together with a revision history as shown in Figure 4.12(b) was generated by the proposed method. We can see from Figure Figure 4.12 (b) that five revisions have been created in order to embed the secret message. And Figures 4.12(d) and 4.12 (e) show the extracts of the differences between the two newest revisions, where the words in red in Figure 4.12(d) were corrected to be those in red in Figure 4.12(e) by the author “Natalie.” Figures 4.12(f) and 4.12(g) shows respectively the messages extracted by Algorithm 4.3 using a right key and a wrong one. These results show that when a user uses a wrong key, the system will return a random string as the message extraction result.

Table 4.3.An example of a chosen set with the new word sequence “such as”.

Original word lot of cover documents as inputs. Since the data embedding capacity is dependent on the secret message content which influences the selections of authors and changed

2 http://www.mediawiki.org/wiki/MediaWiki.

word sequences for each revision, we have run experiments for each document ten times using different messages as inputs, and recorded the average of the resulting data embedding capacities. The parameters of six different cover documents are shown in Table 4.4. For example, document 1 has 2,419 characters, 641 words, and 80 sentences; document 3 has 10,128 characters, 2,211 words, and so on.

(a)

(b)

(c)

(d)

Figure 4.12. An example of generated stego-documents on constructed Wiki site with input secret message “Art is long, life is short.” (a) Cover document. (b) Revision history (c) Stego-document. (d) Previous revision of revision of (e) with words in red being those corrected to be new words in revision of (e) in red. (e) Newest revision of created stego-document. (f) Correct secret message extracted with the right key “1234.” (g) Wrong extracted secret message with a wrong key “123.”

(e)

(f) (g)

Table 4.4. The information of experimental documents.

Document Character Word Sentence Document Character Word Sentence

Document 1 2,419 641 80 Document 4 11,215 2,617 86

Document 2 4,762 956 45 Document 5 26,591 6,180 631

Document 3 10,128 2,211 121 Document 6 60,349 14,306 1,603

In these experiments, firstly we selected the replacing word sequences for a revision to be the top n most frequently used ones in the database DBcw, where n = 2, 4, 8, 16, 32. Figure 4.13(a) shows the resulting data embedding capacities from which we can see that the more the selected replacing word sequences, the more the embedded message bits. This result comes from the fact that when more replacing word sequences are available, the constructed Huffman codes will become longer.

We have also conducted experiments on using different numbers of revisions (1, 2, 4, 8) in the generated stego-documents to see the resulting data embedding capacities. Figure 4.13(b) shows the results which indicate that when the number of revisions in the stego-document is larger, more message bits can be embedded, as expected. This means that if we want to embed a larger secret message, more

revisions should be generated. Yet, on a Wiki site, each revision will be stored as its original text without any compression. Thus, a larger storage space is required to store more generated revisions when the secret message is longer. However, one can solve this issue by simply comparing the difference between two adjacent revisions and only storing the difference between them where this comparison function may be provided by other collaborative writing platforms if desired. Furthermore, we can see also from Figures 4.13(a) and 4.13(b) that when a cover document has a larger size, the resulting data embedding capacity will be larger as well. Thus, if we want to embed more data, we have to choose a larger cover document.

(a)

(b)

Figure 4.13. The embedding capacities. (a) Embedding capacities of documents with chosen sets of different sizes. (b) Embedding capacities of documents with different number of revisions.

Figure 4.14 shows a comparison of the resulting embedding capacities yielded by the proposed method with those yielded by Liu and Tsai’s method [64]. We can see from Figure 4.14 that when the number of revisions of the proposed method is

equal to one, the embedding capacity of the proposed method is very close to that yielded by Liu and Tsai [64]. Note that not every word sequence in the current revision Di–1 can be utilized for data embedding in the proposed method, because we limit the maximum number of corrected word sequences in a revision. Thus, when the number of revisions is just one, the embedding capacity of the proposed method may not be better than that of Liu and Tsai [64] which allows the use of every word for message embedding. However, when the number of revisions is equal to or greater than two, the embedding capacities of the proposed method are instead much larger.

Figure 4.14. Comparison of embedding capacities yielded by Liu and Tsai [64] and proposed method using different numbers of revisions.

Like the methods proposed by [57], [58] which can be utilized for multiple languages, we have tried to apply Algorithm 4.1 to two adjacent revisions of a Chinese document and obtain the correction pairs for them successfully, where the two revisions are shown in Figure 4.15. Note that since Chinese has no explicit word segmentation mark, we cannot use spaces to split an article in Chinese into words.

Therefore, each character in Chinese was treated as a word directly to solve the issue.

Figure 4.15 shows the found correction pairs between the two revisions, in which, e.g., one of the found correction pairs is <做到, 達成>, where both word sequences in the pair mean the same as “achieve” in English.

Figure 4.15. An example to show the interoperability of the proposed method which can be applied on Chinese articles.

Moreover, for the purpose of presenting the contributions made by the proposed method, we have compared it with several other methods for data hiding via texts [35]-[37], [64] as shown in Table 4.5. Firstly, the synonym replacement methods [35]-[37] utilize synonym dictionaries to embed messages, where the synonym dictionaries were usually manually built by language experts. And the embedding capacities of these methods are limited, since only those word sequences in the cover document which exist in the synonym dictionary can be utilized for data embedding.

Also, since they replace the word sequences in a cover document into their synonyms, the resulting stego-document is usually a worse version of the original cover document due to the possible losses of the original meanings in the replacements.

Furthermore, the usage frequencies of the corresponding synonyms of a word sequence are not analyzed in these methods. Secondly, the change tracking method proposed by Liu and Tsai [64] utilizes synonym dictionaries and a small collaborative writing database with only 7,581 chosen sets to embed messages, where the synonym dictionaries were built manually as well. Also, the embedding capacities of this method is limited, since only two revisions are generated by two authors and only the

word sequences in the cover document are degenerated for data embedding. Moreover, the usage frequencies of word sequences of this method are just a simulated one created by using the Google SOAP Search API.

Table 4.5. Comparison of methods for data hiding via texts.

Method Utilized

Automatically Unlimited Unlimited Many Real data

As a summary, several merits of the proposed method can now be pointed out, which include: (1) the database of the proposed method is constructed automatically from Wikipedia, which is the largest collaborative writing platform on the Internet;

therefore, the resulting stego-document generated by the proposed method is more realistic than that generated by the other four methods [35]-[37], [64]; (2) the dababase constructed by the proposed method is much larger than that by Liu and Tsai [64], with 1,688,732 chosen sets in the former and only 7,581 in the latter; (3) the usage frequency of each correction pair used in the proposed method is a real parameter obtained by mining the collaborative writing data found on Wikipedia, but that of Liu and Tsai [64] is just a simulated one created by using the Google SOAP Search API; and (4) the proposed method can simulate the collaborative writing process conducted by multiple authors and revisions, but Liu and Tsai [64] can only

generate one pre-draft version of a cover text, simulating the work of two authors.

Thus, to the best of our knowledge, this is the first work that can simulate the real collaborative writing process with multiple authors and revisions by mining the revision histories on Wikipedia or similar platforms and using the characteristics in the collaborative writing process effectively for message embedding.

Furthermore, to illustrate the usability of the proposed method in the real world, it is pointed out that one can build a collaborative writing platform, such as a Wiki site, for uses by a school, company, or government and then implement the proposed method on this platform. For example, for a school, especially with a large size, the teachers may establish a big wiki site with many documents for general teaching, administration, and communication uses, which are accessible by teachers, staff members, students, parents, etc. Sometimes, a teacher might want to communicate with a student’s parents in a secret way. Then, the wiki site may be used as a platform for such covert communication of messages. In addition, the teacher may keep secret records of the students on the wiki site using the data embedding schemes provided by the proposed method. That is, a collaborative writing platform can not only let people work collaboratively but also can let people hide message into the documents existing on this platform for applications of covert communication and secret data keeping.

在文檔中「富含訊息多媒體」 – 一種普及溝通之新工具 (頁 86-98)