Secret message embedding - Data Hiding via Revision History

Chapter 4 A New Data Hiding Technique via Revision History Records on

4.3 Data Hiding via Revision History

4.3.2 Secret message embedding

In the phase of message embedding with a cover document D0 as the input, the proposed system is designed to generate a stego-document D′ with consecutive revisions {D0, D1, D2, …, Dn} by producing a previous revision Di from the current revision Di–1 repeatedly until the entire message is embedded, as shown in Figure 4.5

where the direction of revision generation is indicated by the green arrows. The stego-document D′ including the revision history {D0, D1, D2, …, Dn} then is kept on a collaborative writing platform, which may be Wikipedia or others. To simulate a collaborative writing process more realistically, we utilize the four aforementioned characteristics of revisions to “hide” the message bits into the revisions sequentially: 1) the author of the previous revision Di, 2) the number of changed word sequences in the current revision Di–1, 3) the changed word sequences in the current revision Di–1, and 4) the replacing word sequences in the previous revision Di, as described in the following.

(1) Encoding the authors of revisions for data hiding.

We encode the authors of revisions to hide message bits in the proposed method.

For this, at first we select a group of simulated authors, with each author being assigned a unique code a, called author a. Then, if the message bits to be embedded form a code aj, then we assign author aj to the previous revision Di as its author to achieve embedding of message bits aj into Di. For example, assume that four authors are selected and each is assigned a unique code a as shown in Figure 4.7, respectively.

If the message bits aj to be embedded is “01,” then Jessy with author code “01” is selected to be the author of the revision Di. Moreover, every revision of D0 through Dn

will be assigned an author according to the corresponding message bits, and so an author can be assigned to conduct more than one revision or reversely no revision in the generated revisions. Figure 4.7. Illustration of encoding authors of revisions for data hiding.

(2) Using the number of changed word sequences for data hiding.

In the process of generating the previous revision Di from the current one Di–1, we select some word sequences in Di–1 and changed them into other ones in Di. It is

desired to use as well the number Ng of word sequences changed in this process as a message-bit carrier.

To implement this aim, at first we set on the magnitude of Ng a limit Nc taken to be the maximum allowed number of word sequences in Di–1 that can be changed to yield Di. This limitation makes the simulated step of revising Di–1 to become Di look more realistic because usually not very many words are corrected in a single revision.

Next, we scan the word sequences in the text of the current revision Di–1 sequentially and search DBcw to find all the correction pairs <sj, sj'> with sj' in Di–1. Then, we collect all sj' in these pairs as a set Qr, which we call the candidate set of word sequences for changes in Di–1. Finally, we select Ng word sequences in Qr to form a set Qc such that the binary version of the number Ng is just the current message bits to be embedded.

We say that two word sequences in Di–1 are dependent if some identical words appear in both of them, and changing word sequences with this property in Di–1 will cause conflicts, leading to a dependency problem which we explain by an example as follows. Di–1 cannot be selected and changed anymore because they include word sequences in q4 which have already been changed and disappeared. That is, any part of a changed word sequence cannot be changed again; otherwise, a dependency problem will occur.

To avoid this problem in creating Di from Di–1, we propose a two-step scheme: 1) decompose Qr into a set of lists, I = {I1, I2, …, Iu}, with each list Ii including a group of mutually dependent word sequences (i.e., with every word sequence in each Ii

being dependent on another in the same list) and every two word sequences in two different lists, respectively, in I being independent of each other; and 2) select only word sequences from different lists in set I and change them to construct a new revision. The details to implement the first step is described in Algorithm 4.2. After applying the first step on the set Qr as shown in Figure 4.8(b), it will be transformed into I = {(q1), (q2, q3, q4, q5), (q6), (q7), (q8), (q9, q10), (q11)} where each pair of parentheses encloses a list of mutually dependent word sequences. With I ready, we can now select word sequences from distinct lists Ii in it, such as q1, q2, q6, and q9, to simulate changes of word sequences in revision Di–1 without causing the dependency problem.

Figure 4.8. Illustration of the dependency problem. (a) Revision Di1 and candidate set Qr where the dependent word sequences are surrounded by red squares. (b) Set I that corresponds to the set Qr for solving the dependency problem.

(2.2) The selection problem.

It is desired to select word sequences for use in the simulated revisions according to their usage frequencies in DBcw, so that a more frequently-corrected word sequence has a larger probability to be selected, forging a more realistic revision.

For this aim, following [32], [64], we adopt the Huffman coding technique to create Huffman codes uniquely for the word sequences in Qr according to their usage frequencies, and select word sequences with their codes identical to the message bits to be embedded. Specifically, according to a property of Huffman coding, the lengths of the resulting Huffman codes of word sequences are in reverse proportion to the usage frequencies of the word sequences. So a word sequence with a shorter Huffman code will have a larger probability to be selected, which can be computed as (1/2)^L where L denotes the number of bits of the code. That is, the use of Huffman coding indeed can achieve the aim of selecting word sequences in favor of those which are more frequently corrected in real cases.

But a problem arises here  after we select one word sequence qy in this way, qy cannot be used in the revision again for encoding an identical succeeding code in the message because qy has already been changed into another word sequence, causing a problem which we call the selection problem. This problem comes partially from the unique decidability property of Huffman coding. To illustrate this problem, for the previous example as shown in Figure 4.8 again, the Huffman codes for word sequences q1 through q11 are shown in Figure 4.9(a), and the message bit sequence to be embedded currently is “100100…” with the first six bits being just two repetitions of the code “100.” For this, at first we select word sequence q4 and change it into another in the revision because the first three message bits to be embedded, “100,” are just the code for q4 (indicated by red color). After this, the next three message bits to be embedded are again the code “100” (the blue color of message bits in Figure 4.9(a)); however, the corresponding word sequence q4 cannot be selected any further because it has already been changed in the current revision version, and other word sequences cannot be selected, either, because their codes are not the same as the current message bits “100” to be embedded.

q1 group Gk according to their usage frequencies, and select a word sequence in Gk with its assigned code identical to the leading message bits for use in the revision. We apply this step repetitively until all groups are processed. In this process, Huffman coding is applied to each Gk with word sequences distinct from those in the other groups, so that the selection problem of choosing a word twice to change due to code repetition in the message will not happen anymore. For example, as shown in Figure

4.9(b), Qr is divided into three groups: G1, G2, and G3, represented by red, blue, and green colors, respectively. Starting from G1, we assign Huffman codes to the elements in each group as shown in Figure 4.9(b). Then, q2 will be selected because the code of q2 is the same as the first three bits “100…” of the message to be embedded. Then, next in G2, q8 will be selected because the message bits to be embedded are currently

“100…” Finally, q11 in G3 will be selected because the current message bits to be embedded are “0…” In this way, the previous problem of being unable to embed the repetitive code “100” is solved automatically. In short, by decomposing randomly the candidate set Qr of word sequences for changes into groups and representing each group by a Huffman code, we can embed message bits sequentially by changing only one word sequence in each group without causing the selection problem.

However, the above process is insufficient; it must be modified in such a way that word sequences which have mutual dependency relations are divided into an identical group in order to avoid the dependency problem as discussed previously. For this aim, instead of decomposing the word sequences in Qr directly into random groups as mentioned previously, we divide randomly the mutually-independent list elements of I into Ng groups, where each group is denoted by GIk. Then, we take out all the word sequences in the lists in each GIk to form a new group of word sequences, denoted as Gk, resulting again in Ng groups of word sequences. For instance, for the previous example as shown in Figure 4.8, let Ng = 2 and suppose that the list elements of I are decomposed randomly into two groups: GI1 = {I1, I2, I3, I4} and GI2 = {I5, I6, Di–1 was changed to be “themselves” in Di. However, because of the consecutiveness of the two words “improve” and “themselves” in Di, the two changes might be considered as a single one during secret message extraction, i.e., the word sequence

“increase in” in Di–1 might be regarded to have been changed to be “improve themselves” in Di. This ambiguity causes a problem, namely, we cannot know whether a change from a word sequence in Di–1 to be another in Di is from one group

or two, or equivalently, we cannot know the true number Ng of changed word sequences in Di–1, so that we cannot extract later the embedded messages bits correctly. We call this difficulty in message extraction a consecutiveness problem.

(a) illustration of the consecutiveness problem. (b) Choosing splitting points randomly to solve the consecutiveness problem.

Obviously, word sequences in different groups must be made non-consecutive in order to solve the problem. For this aim, the previously-mentioned solution to the

word sequences then become: G1 = {q1, q2, q3, q4, q5, q6, q7} and G2 = {q9, q10, q11}.

Because of the existence of the splitting point I5 = (q8), groups G1 and G2 are non-consecutive, and accordingly uses of them for creating word sequence changes in revisions will now cause no consecutiveness problem.

(2.4) The encoding problem.

The issue up to now is how to determine the aforementioned number Ng of word sequences to be changed in Di–1. Although a limit Nc is set for Ng, the maximum number Nm of word sequences that can be selected in Di–1 may even be smaller than Nc. Therefore, we must compute Nm first before we can embed message bits according to the number Ng. After Nm is decided, Ng may then be taken to be a number not larger than Nm. The actual value of Ng is decided by the leading secret message bits, say nm

ones. Consequently, we may assume that Nm satisfies the two constraints of 1) Nm =

2ⁿ^m and 2) 1 ≤ Nm ≤ Nc, where nm is a positive integer. In addition, in order to embed message bits by selecting a word sequence from a group Gk, the number of elements in Gk should not be smaller than two so as to embed at least one message bit by Huffman coding; hence, each group GIk mentioned previously should be created to include at least two elements of I. Accordingly, the maximum number Nm of word sequences to be changed in Di–1 can be figured out to satisfy the following formula:

[NI  (Nm  1)]/Nm  2, (11)

where NI is the number of elements in set I and Nm – 1 represents the aforementioned number of chosen splitting points. The inequality (11) can be reduced to

Nm  (NI + 1)/3. (12)

Accordingly, we can compute Nm by the following rule:

if (NI + 1)/3 > Nc, set Nm = Nc;

if 1  (NI + 1)/3  Nc, set N_m 2^^^{log (}² ^N^I^^1)/3^^. (13) Furthermore, the content of Di–1 might be too little for Nm to be decided by Eq. (13).

In that case, we abandon the original cover document D0 from which Di–1 is generated, and use another longer cover document as the input. After the value of Nm is computed, we can then use the leading nm bits of the message to decide the number Ng

of changed word sequences in Di–1 by two steps: 1) express the first nm message bits as a decimal number; and 2) increment the decimal number by one. The second step is

required to handle the case that the first nm message bits are all zeros, which leads to the undesired result of no word sequence being changed in the current revision. In this way, Ng becomes really a carrier of nm message bits. For example, the number of elements of the set I for the previously-mentioned example as shown in Figure 4.8 is NI = 7. Let Nc = 4. Because (NI + 1)/3 = (7 + 1)/3 ≈ 2.67 ≤ 4 = Nc, Nm is computed to

(3) Encoding the changed word sequences in the current revision for data hiding.

According to the previous discussions, we may assume that we have computed the number Ng of word sequences which should be changed in the current revision Di–1 according to the first nm bits of the secret message, and that we have classified the available word sequences in Qr into Ng groups, where each group Gk includes at least two word sequences and all word sequences in Gk are encoded by Huffman coding according to their usage frequencies. Specifically, the usage frequency of a word sequence sj' is taken to be the summation of the correction counts of all the correction pairs in the chosen set of sj', which have sj' as their common new word sequence.

Then, starting from G1, we may select from each group Gk one word sequence with a Huffman code identical to the leading bits of the message to be embedded, achieving the goal of data hiding via changing word sequences in Di–1.

For example, assume that the usage frequencies of the word sequences in group G2 as shown in Figure 4.10(b) are: q9 = 100, q10 = 50, and q11 = 150; and the message is “10100….” Then, the Huffman codes assigned to q9, q10, and q11 are “01,” “00,”

and “1,” respectively; and so we select q11 to hide the first bit “1” of the message because the code of q11 is “1.”

(4) Encoding the replacing word sequences in the previous revision for data hiding.

Symmetrically, we may use as well the replacing word sequences in Di to embed message data, where each replacing word sequence sj in Di corresponds to a changed word sequence sj' in Di–1, forming a correction pair <sj, sj′>. Specifically, recall that for each sj', we can find a chosen set of correction pairs from DBcw. From this set, we

can collect all the original word sequences of the correction pairs as another set Qc', with each word sequence in Qc' being appropriate for use as the replacing word sequence sj. Let Qc' = {s1, s2, …, sw}. Then, to carry out message data hiding, we encode all sj in Qc' by Huffman coding according to their usage frequencies as well, and choose the one with its code identical to the leading message bits for use as the word sequence sj replacing sj'. Here the usage frequency of each sj is the correction count of the correction pair <sj, sj'>. For example, Table 4.3 shows the chosen set of the word sequence “such as” with all included original word sequences already assigned Huffman codes according to their usage frequencies. Based on the table, if the message to be embedded currently is “01001001…,” then we change the word sequence “such as” in the current revision Di–1 to be the word sequence “for example”

in the previous revision Di because the Huffman code for “for example,” namely, 0100, is the same as the first four bits of the secret message.

(5) Secret message embedding algorithm.

As a summary, we have demonstrated the usability of the aforementioned four characteristics of revisions for data hiding. Therefore, we can generate a stego-document with a forged revision history which looks like a realistic work written by people collaboratively. The details of the proposed message embedding process are described in Algorithm 4.2 below.

Algorithm 4.2. Secret message embedding.

Input: a cover document D0, a binary message M of length t, a secret key K, and the collaborative writing database DBcw constructed by Algorithm 4.1.

Output: a stego-document D′ with a revision history {D0, D1, D2, …, Dn}.

Steps:

Stage 1  message preparation and parameter determination.

Step 1. (Message composition) Affix an s-bit binary version of the message length t to the beginning of M to compose a new binary message M′, where the number s of bits for representing t is agreed by the sender and the receiver beforehand.

Step 2. (Message encryption) Randomize M' to yield a new binary message M′′ using the key K.

Step 3. (Parameter determination) Use K again to decide randomly both an integer Na for use as the number of authors and another integer Nc for use as the limit on the number Ng of word sequences to be changed in every revision.

Step 4. (Author encoding) Create Na authors to form an author list Ia, and assign a unique na-bit code to each author in Ia.

Stage 2  message embedding and revision generation.

Step 5. (Message embedding and revision generation) Generate the previous revision Di from the current revision Di–1 repeatedly by the following procedure until all bits in M'' are embedded, where i = 1 initially.

Stage 2.1  embedding data via author encoding.

(a) (Embedding bits by an author code) Choose for Di the author aj from the author list Ia with the na-bit code assigned to author aj being identical to the leading na bits of M′′; and remove these na bits from M''.

Stage 2.2  embedding data using the number of changed word sequences in the current revision.

(b) (Finding the candidate word sequences for changes in Di1) Create the candidate set Qr of word sequences for changes in Di1 by the following steps.

(i) Take in order an unprocessed word w in revision Di1, and set the currently processed word sequence q as w initially.

(ii) Check if q matches some leading words or all of the words in the new word sequence of any correction pair in DBcw 

if so, do the following two steps:

(A) if q is identical to the entire new word sequence, then add q to Qr and continue;

(B) create a new word sequence, still denoted as q, by concatenating the old q with the word qr which is to the right of q in Di1, and go to Step 5.b.ii;

each Ii is a list of mutually dependent word sequences and every two lists are independent.

(i) Take each word sequence in Qr as a list initially.

(ii) Check the ordered word sequences in Qr one by one sequentially:

if the currently-checked word sequence qs and its previous one qt

(i) Compute the maximum number Nm of word sequences to be changed in revision Di–1 by inequalities Eq. (13) described previously, and compute nm as log2Nm.

(ii) Decide the number Ng as the decimal value of the first nm bits of message M'' plus one, and remove these nm bits from M''.

Stage 2.3  embedding data via the word sequences changed in the current revision.

(e) (Choosing splitting points) Choose randomly Ng – 1 elements of I as splitting points using the key K.

(f) (Classifying the word sequences into independent sets) Divide the elements of I into Ng groups, GI1 through GINg, by the splitting points, and take out all the word sequences in the lists in each GIk to form a new group of word sequences, denoted as Gk, resulting in Ng groups of word sequences, G1

through GNg.

(g) (Choosing word sequences to change for message-bit embedding) For each group Gk with k = 1 initially, encode its word sequences by Huffman

(h) (Finding the chosen set) Find the chosen set of the previously marked word sequence sj' from the correction pairs kept in DBcw, and collect all the original word sequences in the chosen set as a set Qc'.

(i) (Choosing the original word sequence for use in replacement) Encode the word sequences in Qc' by Huffman coding according to their usage frequencies, choose the word sequence sj in Qc' with its code matching the leading message bits of M'', and remove the matched leading bits from M''.

在文檔中「富含訊息多媒體」 – 一種普及溝通之新工具 (頁 73-86)