• 沒有找到結果。

Message extraction and image recovery

Chapter 3 A New Data Hiding Technique via Encrypted Images by Image

3.3 Proposed Method

3.3.2 Message extraction and image recovery

To extract the message embedded in the encrypted image I′′, firstly the key Ke is used to decrypt image I′′ to obtain another, denoted by I′′′, whose pixels’ four MSBs are all the same as those of the pixels of the original cover image I. Next, the four

scrambled version Hm,n,1; and if the embedded bit is 1, then since the original cover block is encrypted only once by the key Ke, the decrypted block Hm,n,1 using the same key Ke will become the original block, which is usually smoother than the scramble version Hm,n,0 as well. Moreover, to compute the block smoothness, we adopt the measure used in [56] described by (9), but, unlike the side-match scheme used in [56]

which utilizes only the recovered block to compute the smoothness, a new side-match scheme using both unrecovered and recovered blocks is proposed.

In more detail, for each block Bx,y′ of the four blocks adjacent to each block Bm,n' in I''', if Bx,y′ is unrecovered yet, then the values of fm,n,0′ and fm,n,1′ computed according to (9) are augmented in the following way:

set fm n, ,0  fm n, ,0 min(Hm n, ,0 Hx y, ,0, Hm n, ,0 Hx y, ,1); and fm,n,1′ are augmented in the following way:

set fm n, ,0  fm n, ,0  Hm n, ,0 Hx y r, ,  ; blocks, where currently-processed adjacent block of Bm,n′ is Bm+1,n′.

Since blocks adjacent to Bm,n′ are all used to compute the smoothness no matter whether they are recovered or not, the resulting bit-extraction error rate will be smaller than other schemes not doing so. Figure 3.3 includes some results showing this effect for 88 blocks with a comparison with that yielded by [56].

(a) (b)

(c) (d)

Figure 3.3. Recovery results showing effects of using both recovered and unrecovered blocks for measuring smoothness of 88 blocks, with incorrectly-recovered blocks marked as white, and error rate denoted by err. (a) Result with err = 2.66% yielded by [56]. (b) Result with err = 0.46% yielded by proposed method without using side-match. (c) Result with err = 0.27% yielded by proposed method using only recovered blocks in side-match scheme. (d) Result with err = 0.22% yielded by proposed method using both recovered and unrecovered blocks in side-match scheme.

3.4 Experimental Results

Four 512512 test images, Figures 3.4(a) through 3.4(d), were used in the experiments, and the results of the proposed method are compared with those yielded by [55] and [56], as illustrated in Figure 3.5 which includes plots of the trends of bit-extraction error rates vesus different block sizes ss. It is seen from Figures 3.5(a) through 3.5(d) that the error rates yielded by the proposed method are much smaller than those yielded by [55] and [56]. For example, for the cover image Figure 3.4(a) with block size 88, Figure 3.5(a) shows that the bit-extraction error rates using [55]

and [56] are 12.87% and 10.21%, respectively; and that yielded by the proposed method is 0.07%. Moreover, Figure 3.5(a) shows that the error rate yielded by the proposed method is zero when s is larger than 12, but those yielded by both [55] and [56] are still larger than zero when s = 32.

(a) (b)

(c) (d)

Figure 3.4. Four test images of size 512512.

Additionally, we compare the execution time of the proposed method with that of [55] and [56] in message embedding. As mentioned previously in Sections 3.2 and 3.3.1, to embed a message bit, three LSBs of a portion of the pixels of each image

block are flipped in [55] and [56] and four LSBs of each block pixel are encrypted further, respectively. Therefore, the execution time of the proposed method and those of [55] and [56] for message embedding are all very short since the operations involve only encryptions and flippings, respectively. For example, for the cover image of Figure 3.4(a), Figure 3.6 shows a comparison of the execution time for message embedding required by the proposed method with those required by [55] and [56]

versus different block sizes, where the execution times of them are all very fast (about 0.1 ~ 0.15s).

(a)

(b)

(c)

Figure 3.5. Comparisons of bit-extraction error rates yielded by proposed method with those yielded by [55] and [56] versus different block sizes. (a) Error rates with cover image Figure 3.4(a). (b) Error rates with cover image Fig. Figure 3.4(b). (c) Error rates with cover image Figure 3.4(c). (d) Error rates with cover image Figure 3.4(d).

(d)

Figure 3.5. Comparisons of bit-extraction error rates yielded by proposed method with those yielded by [55] and [56] versus different block sizes. (a) Error rates with cover image Figure 3.4(a). (b) Error rates with cover image Fig.

Figure 3.4(b). (c) Error rates with cover image Figure 3.4(c). (d) Error rates with cover image Figure 3.4(d) (continued).

Figure 3.6. Comparison of execution time for message embedding required by proposed method with those required by [55] and [56] versus different block sizes.

Also, experiments for comparisons of the effects of using three or four LSBs for data embedding have also been conducted. Results for 88 blocks are shown in Figure 3.7, where the number of used LSBs is denoted by NL. Specifically, the methods [55]

and [56] do not perform better when NL = 4 for image Figure 3.4(a) as can be seen from Figures 3.6(e) through 3.6(h). The same conclusions can be drawn for other images. Contrarily, Figures 3.6(c) and 3.6(d) show that the proposed method performs better as NL is enlarged from 3 to be 4.

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 3.7. Recovery results showing effects of using different numbers NL of LSBs for 88 blocks with incorrectly-recovered blocks marked as white and error rate denoted by r. (a) Cover image. (b) Decrypted image with message embedded. (c) Result with r = 0.90% yielded by proposed method for NL = 3. (d) Result with r = 0.07% yielded by proposed method for NL = 4. (e) Result with r = 12.87% yielded by [55] for NL = 3. (f) Result with r

= 29.27% yielded by [55] for NL = 4. (g) Result with r = 10.21% yielded by [56] for NL = 3. (h) Result with r = 27.76% yielded by [56] for NL = 4.

Furthermore, the average distortion of the decrypted image with respect to the original image by using the proposed method can be computed, which is described as follows. Firstly, a decrypted pixel in the decrypted image has two possibilities: (1)

correct decryption  the same as the original pixel; or (2) incorrect decryption  a where i respresents the difference between the decrypted gray value and the original one. The value of the PSNR of the decrypted image with respect to the original image is approximately image by using the proposed method is about 32.25 dB, which is not so good as those of [55] and [56] with the average PSNR value of 37.9 dB. However, the proposed method significantly reduces the bit-extraction error rate and solves the flat image problem without keeping the spatial similarity of the LSBs of the pixels in each block as mentioned previously.

3.5 Summary

A new data hiding method via encrypted images based on double image encryptions and refined spatial correlation comparison has been proposed, which does not have the weakness of two existing methods [55] and [56] in handling flat cover images. The weakness comes from the way of flipping the three LSBs of each pixel in part of each block in an encrypted image to embed a message bit. The proposed method improves this by encrypting the four LSBs of each pixel of every block instead of flipping three of them to embed a bit. Also, a refined side-match scheme utilizing the spatial correlations of both recovered and unrecovered blocks has been proposed to decrease the bit-extraction error rate, in contrast with Hong et al. [56]

which utilizes only those of recovered blocks. Experimental results show the feasibility of the proposed method. Future studies may be directed to applying the proposed method for various information hiding purposes.

Chapter 4

A New Data Hiding Technique via Revision History Records on Collaborative Writing Platforms

4.1 Introduction

Recently, more and more collaborative writing platforms are available, such as Google Drive, Office Web Apps, Wikipedia, etc. On these platforms, a huge number of revisions generated during the collaborative writing process are recorded.

Furthermore, many people work collaboratively on these platforms. Thus, these platforms are very suitable for data hiding applications, such as covert communication, secret data keeping, etc. It is desired to propose a new method which is useful for covert communication or secure keeping of secret messages on collaborative writing platforms However, the above-mentioned data hiding methods via text [29]-[38] in Section 1.3.3 can only be applied to documents with single authors and single revision versions, meaning that they are not suitable for hiding data on collaborative writing platforms. Therefore, the goal of this study is to propose a new data hiding method which can hide data into documents created on collaborative writing platforms. In more detail, a new data hiding method is proposed, which simulates a collaborative writing process to generate a fake document, consisting of an article and its revision history, as a camouflage for message bit embedding. As shown in Figure 4.1, with the input of an article and a secret message, the proposed method utilizes multiple virtual authors to collaboratively revise the article, generating artificially a history of earlier revisions of the article according to the secret message. An ordinary reader will consider the resulting stego-document as a normal collaborative writing output, and cannot realize the existence of the secret message hidden in the document.

Cover document Simulate the real collaborative

Moreover, the previously-mentioned linguistic methods [35]-[38] in Section 1.3.3 use written natural languages to generate stego-documents and can produce more innocuous stego-texts than other data hiding methods, but an issue common to them is how to find a nature way for simulating the writing process and how to obtain large-volume written data automatically. Hence, another goal of this study is to find a nature way to generate the revision history and to obtain large-volume collaborative writing data automatically. In recent years, some researches have been conducted to analyze the revision history data of Wikipedia articles for various natural language processing applications [57]-[63], such as spelling corrections, reformulations, text summarization, user edits classification, multilingual content synchronization, etc. In addition to being useful for these applications, the collaboratively written data in Wikipedia are also very suitable, as found in this study, for simulating the collaborative writing process for the purpose of data hiding since it is the largest collaborative writing platform nowadays.

In [64], Liu and Tsai proposed a data hiding method via Microsoft Word documents by the use of the change tracking function, which embeds a secret message by mimicking a pre-draft document written by an author with an inferior writing skill and encoding the secret message by choices of degenerations in the writing. Although they used three databases for degenerations, the sizes of them are quite small when compared to that of the database constructed from Wikipedia which we make use for data embedding in this study. It is noted by the way that a data hiding method can, as well known, embed more bits by making use of a larger database. Furthermore, in [64]

a stego-document is generated by only two virtual persons and the change tracking data are made by the one with a better writing skill. This scenario is insufficient for simulating a normal collaborative writing process. Therefore, in this study we propose

a new framework that uses the revision-history data from Wikipedia and simulates real collaborative writing processes to hide secret messages. Four characteristics of collaborative writing processes are analyzed and utilized for message hiding, including the author of each revision, the number of corrected word sequences, the content of the corrected word sequences, and the word sequences replacing the corrected ones. The proposed method is useful for covert communication or secure keeping of secret messages on collaborative writing platforms.

4.2 Basic Idea of Proposed Method

Collaborative writing means an activity involving more than one author to create an article cooperatively on a common platform. The purposes of establishing a collaborative writing platform includes knowledge sharing, project management, data keeping, etc. Many collaborative writing platforms are available, such as Google Drive, Office Web Apps, Wikipedia, etc., which record revisions generated during the collaborative writing process. In general, the recorded information of a revision includes: 1) the author of the revision, 2) the time the revision was made, and 3) the content of the revision. For example, Figure 4.2 shows a screenshot of the revision history of an article about computer vision on Wikipedia.

Figure 4.2. A screenshot of the revision history of an article about computer vision on Wikipedia.

To achieve the goal of creating camouflage revisions in collaborative writing for message hiding in this study, we analyze the existing revision-history data of articles on Wikipedia, which is the largest collaborative writing platform on the Internet currently in the world. The aim is to get real and large collaborative writing data contributed by people all over the world and use them to create more realistic revision

histories to enhance the resulting effect of data embedding. However, since the collaborative writing process is very complicated, it is hard to find a unified model to simulate it. Many different types of modifications may be made during the collaborative writing process [57], [59], such as error corrections, paraphrasing, factual edits, etc. Moreover, different languages usually require different models to represent due to their distinctive grammatical structures. Therefore, in order to get useful collaborative writing data automatically from the revision history data on Wikipedia without building models manually and to generalize a method that can be applied to multiple languages, we assume that only word sequence corrections occur during a revision. Some characteristics in collaborative writing based on this assumption for data embedding are identified, which will be discussed in the following. It is noted that various text articles, not only in English but also in other languages, can be utilized as cover media in this study.

The revision history of each article in Wikipedia is stored in a database, and one can recover any previous revision version of the article by an interface provided on the site. As an illustration, Figure 4.3 shows a screenshot of two consecutive revisions of an article about computer vision on Wikipedia. For this study, we have collected a large set of revision-history data from Wikipedia, and in the proposed method we mine this set to get useful information about word usages in the revisions. Then, we use the acquired information to simulate a collaborative writing process, starting from a cover article; and generate a stego-article with a sequence of revisions according to the secret message and a secret key. The resulting stego-document, including the stego-article and the revision history, looks like a work created by a group of real authors, achieving an effect of camouflage. In contrast, we call the original article with an initially-empty history a cover document in the sequel.

More specifically, the proposed method includes three main phases as shown in Figure 4.4: 1) construction of a collaborative-writing database; 2) secret message embedding; and 3) secret message extraction. In the first phase, a large number of articles acquired from Wikipedia are analyzed and useful collaboratively written data about word usages are mined using a natural language processing technique. The mined data then are used to construct a database, called the collaborative writing database, denoted as DBcw subsequently. In the second phase, with the input of a cover document, a secret message, and a secret key, a stego-document with a fake revision history is generated by simulating a real collaborative writing process using

DBcw. The revisions in the history are supposed to be made by multiple virtual authors;

and the following characteristics of each revision are decided by the secret message:

1) the author of the revision; 2) the number of changed word sequences of the revision;

3) the changed word sequences in the revision; and 4) the word sequences selected from the collaborative writing database DBcw, which replace those of 3), called the replacing word sequences in the sequel. And in the third phase, an authorized person who has the secret key can extract the secret message from the stego-document, while those who do not have the key cannot do so. They even could not realize the existence of the secret message because the secret message is disguised as the revision history in the stego-document. Note that the second and third phases can be applied on any collaborative writing platforms, not just on Wikipedia; Wikipedia is merely utilized in the first phase to construct the collaborative writing database DBcw in this study.

Figure 4.3. A screenshot of two consecutive revisions of an article about computer vision on Wikipedia.

Articles in Wikipedia

1. Collaborative writing database construction:

mining collaborative writing data in Wikipedia

Collaborative writing database

Cover document 2. Secret message embedding: simulating a real collaborative writing process

Secret message

Stego-document with revision history

3. Secret message extraction: extracting secret message from revision history information

Figure 4.4. Flow diagram of the proposed method.

4.3 Data Hiding via Revision History

In this section, the details of the proposed method for using the analyzed characteristics of collaborative writing to hide secret messages are described in the following, where the first part is collaborative writing database construction, the second part is secret message embedding, and the final part is secret message extraction.

4.3.1 Collaborative writing database construction

To construct the aforementioned collaborative writing database DBcw, we try to mine the revision data collected from Wikipedia. There were about 4.2 million articles in the English Wikipedia in May 2013, which is a very large knowledge repository;

therefore, it is suitable to use it as a source for constructing the database DBcw desired in the study. Specifically, at first we downloaded part of the English Wikipedia XML dump with the complete revision histories of all the articles on August 3, 2011. Then, we mine the useful collaborative writing data from the downloaded data set under the assumption that only word sequence corrections will occur during a revision.

As illustrated in Figure 4.5, each downloaded article P has a set of revisions {D0, D1, …, Dn} in its revision history, where a newer revision Di has a smaller index i with D0 being the latest version of the article. For every two consecutive revisions Di

and Di–1, we find all the correction pairs between Di and Di–1, each denoted as <sj, sj′>, where sj is a word sequence in revision Di and was corrected to become another, namely, sj′, by the author of revision Di–1. Then, we collect all correction pairs so found to construct the database DBcw. For example, assume Di = “National Chia Tang University” and Di–1 = “National Chiao Tung University” as shown in Figure 4.6.

Figure 4.5. Illustration of used terms and notations.

Figure 4.6. An example of found correction pairs between Di and Di–1

Moreover, about the properties of correction pairs, it was observed that if the context of a word sequence sj in revision Di is the same as that of a word sequence sjin revision Di–1 (that is, if the preceding word of sj is the same as that of sj′ and the succeeding word of sj is the same as that of sj′ as well), then <sj, sj′> is a correction pair. Accordingly, a novel algorithm is proposed in this study for finding automatically all of the correction pairs between every two consecutive revisions for inclusion in DBcw. The algorithm is an extension of the longest common subsequence (LCS) algorithm [65]. The details are described in Algorithm 4.1.

Algorithm 4.1. Finding correction pairs.

Input: two consecutive revisions Di and Di–1 in the revision history of an article P.

Output: the correction pairs between Di and Di–1. Steps:

Stage 1  finding the longest common subsequence.

Step 1. (Splitting revisions into word sets) Split Di and Di–1 into two sets of words, W

= {w1, w2, …, wn} and W′ = {w1′, w2′, …, wm′}, respectively.

Step 2. (Constructing a counting table by dynamic programming) Construct an nm counting table T to record the lengths of the common subsequences of W and W' as follows.

(a) Initialize all elements in table T to be zero.

(b) Compute the values of table T from the upper left and denote the currently-processed entry in T by T(x, y) with x = 1 and y = 1 initially.

(c) If the content of wx is identical to that of wy′, then let T(x, y) = T(x – 1, y – 1) + 1; else, let T(x, y) = max (T(x – 1, y), T(x, y – 1)).

(d) If x is not larger than n, then let x = x + 1 and go to Step 2c); else, if y is not larger than m, then let x = 1 and y = y + 1 and go to Step 2c); else, regard table T as being filled up and continue.

Step 3. (Finding the longest common subsequence) Apply a backtracking procedure to table T, starting from T(m, n), to find the longest common subsequence L = {l1, l2, …, lt}, where each element li in L is a word common to W and W'.

Stage 2  finding the correction pairs.

Step 4. (Finding the correction pairs) Starting from the first element l1 of L with the currently-processed element in L being denoted by lp, find the correction

Step 4. (Finding the correction pairs) Starting from the first element l1 of L with the currently-processed element in L being denoted by lp, find the correction