• 沒有找到結果。

Chapter 3 A New Steganographic Method for Data Hiding in Microsoft Word Documents by a

3.6 Security Considerations

In the proposed method, it is assumed that the degeneration sets and the key used are agreed upon by the sending and receiving parties beforehand. The degenerations in the degeneration database should model realistic errors to counter visual steganalysis. Our way of using existing collaborative data as done in the experiments described previously can achieve this purpose. Specifically, an adversary inspecting a stego-document yielded by our experiments could not tell whether it is really an actual author making the mistakes, or whether the mistakes are introduced for the steganographic purpose by using the proposed method.

Next, it is noted that the proposed approach is robust against statistical steganalysis because the degenerations are chosen according to their occurrence probabilities, resulting in occurrence frequencies of degenerations in a stego-document being in line with those of normal documents. To ensure the statistical properties of the degenerations of a stego-document is close to that of a normal document, we can encrypt the message first before embedding so that the bits of the encrypted message are randomly distributed. Nevertheless, there is still a chance for the statistics of the degenerations to stray away from that of a normal document. To ensure maximal statistical coherence, we can alter the occurrence probabilities of the degenerations appropriately during message embedding. For example, we can halve the occurrence probability of a degeneration option after it has been chosen so that it is less likely to be chosen later again, thus achieving the desired statistical coherence with a normal document.

In addition, to ensure that the scheme is as inconspicuous as possible to adversaries, the degeneration database used should only be known by the communicating parties. This can be achieved for example by using an evolving degeneration database by modifying the degeneration database dynamically when normal collaborative documents are transmitted between the parties. One way to implement this idea is for the communicating parties to add into the degeneration database a key-dependent subset of the degenerations derived from collaboratively working on a normal document. In this way, an adversary cannot determine the exact contents of the degeneration database being used.

Another technique that can be employed is for a sender to use the proposed method to embed the intended payload into a stego-document and then manipulate the unused portions (essentially the padding parts of M') of the stego-document to include degenerations outside

their agreed degeneration database to mislead adversaries. The extra degenerations so introduced are assumed to be ignored by the receiver.

3.7 Summary

A new steganographic method for data hiding in Microsoft Word documents have been presented in this chapter. The data embedding is disguised such that the stego-document appears to be the product of a collaborative writing effort. Information is embedded in the degeneration stage of document transformation with steganographic effects. The degeneration stage introduces different degenerations mimicking an author with inferior writing skill, with the secret message embedded in the choices of degenerations using Huffman coding. The proposed message embedding and extraction methods have been implemented, proving the feasibility of the proposed method.

The proposed change-tracking technique for the steganography purpose is special in that the modifications made during embedding are essential information not to be tampered with ignorantly. On the contrary, methods that are based on redundant or unused information, imperceptibility, or alternative representations, are more vulnerable against attacks by an active warden, who can process all suspects without affecting normal cases of usage. The degeneration database used during degeneration does not need to be private to the sender and the receiver if the degenerations are realistic, as the warden cannot distinguish between legitimate collaboration cases and covert communication cases.

Figure 3.4. Extracts of stego-documents produced using the proposed method with databases 1 and 3.

Figure 3.5. Extracts of stego-documents produced using the proposed method with databases 1 and 2.

Chapter 4

Quotation Authentication: A New Approach and Efficient Solutions by Data Hiding and Cascade Hashing Techniques

4.1 Introduction

The advancement of digital technology and the Internet has greatly eased the publication of information to the mass, whether in the form of web pages, e-mails, PDF files, or office documents. The abundance of contents made available is mostly beneficial to the general public, but significant amounts of annoying or even harmful information are also being distributed on the Internet.

There are many ways to help identify useful information. Search engines, such as Google and Yahoo, analyze the inter-linkage of documents on the web and deduce the relative degrees of importance of the documents by using their page ranking technology [88], with the result that documents ranked high in their search results tend to be more trustworthy. Also, people tend to trust information published by credible sources, for example, announcements by federal or local governments; publications from the IEEE, ACM, or other famous publishers with stringent peer review processes; news reported by CNN, ABC, or other well-known news agencies; and so on.

There are however numerous information publishers and distributors which are less known to the public and yet provide useful information. In particular, they sometimes organize valuable contents published by credible sources by quoting, aggregating, or disseminating them to make the edited data more relevant or accessible to readers. For example, it is common for an article to quote significant findings from technical journals or government reports that would otherwise never be read by the general public.

Unfortunately, there is no easy way for a reader to verify that the quotations contained in these documents are really from the claimed sources. This vulnerability is frequently used in

Internet hoaxes, such as virus warnings and sympathy letters. One mobile phone virus hoax,

for example, claims that if a user answers a call without caller identity, the user’s phone can be infected and no longer usable. The hoax lures the reader by claiming that the information has been confirmed by credible mobile phone companies (like Nokia and Motorola), and that

the news was reported on CNN. Due to the credibility of the claimed sources, most people forwarded the message without performing any further verification. The multiplicative effect of such blind message passing puts unnecessary burdens on the network and servers, and costs the reader valuable time to process them. Another example is nutrition and health recommendations that point out unhealthy food, products, or life styles. The contents are mixed with myths, true pieces of information, forged allegations on competitor companies, and so on. Such contents come from a variety of sources, some of which may not be accessible online or require special subscriptions, making it difficult for a reader to verify the truthfulness of the information.

In this study, we investigate this problem of quotation authentication, where a new approach is proposed to allow efficient authentication of quotations from trusted sources but embedded in documents created by unknown authors. The three parties involved in quotation authentication are respectively the source author AS of a source document S, the document

author A

D of a document D that contains quotations from S, and the document reader RD of D.

We assume that RD recognizes AS as a credible source but does not know AD. The goal of quotation authentication is to allow RD to efficiently and undisputedly verify the fidelity and source of the quotations contained in D.

It is noted that methods to authenticate a message as a whole are relatively well studied.

Typically, two communicating parties are assumed to share a secret key that is used to generate an appropriate message authentication code (MAC) for verifying the fidelity of contents transmitted over an insecure channel. Bellare et al. [90], [91] proposed the keyed-hash message authentication code (HMAC) that can generate provably secure message authentication codes based on standard hash functions such as MD5 and SHA1. Cipher-based MAC (CMAC) techniques such as XMAC and OMAC use standard block ciphers such as DES and AES to construct message authentication codes [92]. In the asymmetric case, a sender creates, using its private key, an appropriate digital signature for the transmitted content, such that the receiver can verify the fidelity of the content using the public key of the sender. A straightforward way to generate a digital signature is to generate a hash value for the content and encrypt the value using asymmetric encryption such as the RSA algorithm, as described in PKCS#1 [93]. A related research area is digital certificate works, which provide means for associating the human recognizable identity of an entity with that entity’s public key. HTTPS is an example of such works where websites register and use X.509 certificates that are issued by trusted certificate authorities [94], allowing users to authenticate the websites they connect to.

We will describe in the subsequent sections how to efficiently generate quotation signatures using the above-mentioned primitives. Specifically, in the proposed method we assume that signatures contain (or contain a reference to) an appropriate digital certificate of the source author so that a document reader can obtain a source author’s public key and verify its fidelity. The public key can then be used to authenticate quotations that were previously signed using the source author’s private key.

There is a branch of research in stream authentication [95] that bears some resemblance to the problem of quotation authentication, where receivers need to verify whether the received stream of packets are from the intended source. The problem is made complex when dealing with authenticating multicast traffic in the face of packet losses. Efficient schemes for multicast authentication have been proposed by using the temporal relationships of network packets, that is, later packets can be used to verify the authenticity of prior packets [96]. These techniques, however, cannot be applied to the problem of quotation authentication studied here since there’s no notion of time in documents and quotations.

The importance to verify quotations within a document was recognized by Fernstrom [89]. However, assumed in the proposed solutions is either that the document reader has access to the original source document or that there is a trusted party to which document authors can send arbitrary quotations and source documents in order to get endorsed quotations that are verifiable by the document reader.

4.2 Overview of Proposed Method

A more practical approach is proposed in this dissertation to solve the quotation authentication problem, where each of the three parties only needs to perform certain simple and deterministic processes, as described by the following scenario:

1. AS processes the source S to create a certain source signature GS, and then publishes S along with GS to the general public;

2. AD cites a text segment q of interest within S, uses GS to create an appropriate quotation

signature G

q for q, applies data hiding techniques to create an integrated

authenticable-quotation q', and puts q' into D;

3. finally RD, when reading D, verifies that the quotation q, which is contained in q', is indeed authored by AS.

The benefit of using data hiding techniques in Stage 2 above to create an integrated authenticable-quotation q' is that the quotation can be distributed multiple times using standard copy-and-paste operations by several document authors without hindering the ability for the final document reader to identify such authenticable-quotations and to verify that the quotation is indeed authored by AS. This is illustrated in Figure 4.1 below.

It is assumed in the above proposed method that an appropriate quotation signature Gq can be generated using the source signature GS, thus alleviating the need for a third-party to endorse quotations. Also, Gq alone is sufficient for RD to verify that the quotation q is indeed from AS, so that RD does not have to access the original source S. Although the digital signatures GS and Gq, attached to S and q, respectively, are the overhead required to achieve the goal of quotation authentication, yet we propose techniques for generating appropriate source and quotation signatures to minimize the sizes of the overhead, as described in the sequel. In more detail, assume that texts are quoted from S by P document authors to generate

P documents, which are consumed by Q document readers. The size of the total overhead of

all GS and Gq for such quotation authentication, called total overhead size, is P|GS| + P|Gq| + Figure 4.1. Illustration of processes performed by and information passed between a source

author, one or more document authors, and a document reader.

S

Q|G

q|, since each of the P document authors needs to receive S as well as GS from the source author and include q as well as Gq in his/her document, and each document reader needs to receive q and Gq from a document author.

4.3 Basic Signature Generation Techniques for Quotation Authentication

In this section, we describe two basic techniques to generate appropriate source and quotation signatures for the purpose of quotation authentication, and point out their shortcomings. The proposed more efficient techniques are then described subsequently.

A basic quotation authentication technique is for the source author to generate a digital signature for every possible quotation in a source document and to publish the signatures along with the source document on a certain website. A document author then only needs to quote the desired text and bundle it with the associated published digital signature to generate a valid authenticable-quotation. This technique is however inefficient since a quotation may start and end at any position of a document, and so the number of possible quotations required for a document of length L is, as can be figured out, of the complexity order O(L2). This means in turn that the number of digital signatures is also of the order O(L2). This scheme is referred to as the enumerate-all-quotations technique in the sequel.

Another basic technique is to include the entire source document S, together with an appropriate digital signature, into the quotation signature Gq of a quotation q. A document reader can then easily extract S from Gq and verify that the quotation really came from S.

However, including the whole article will bloat the size of Gq to be of the order O(L), and the act of attaching the entire article S might violate copyright law. This scheme is referred to as the quote-the-whole technique in the sequel.

The total overhead size of the enumerate-all-quotations technique is of O(PL2 + P + Q), or equivalently, O(PL2 + Q) since GS contains O(L2) digital signatures and Gq contains a single digital signature. On the other hand, the total overhead size of the quote-the-whole technique is of O(P + PL + QL), or equivalently, O(PL + QL), since GS is a simple digital signature, but the size of Gq is of the order O(L).

If Q is significantly larger than P, that is, if there are many more document readers than document authors, then the enumerate-all-quotations technique is more efficient than the quote-the-whole one in view of the total overhead size. However, if the magnitudes of P and

Q are comparable, then the quote-the-whole technique is more efficient. Since the two

techniques have their respective merits for different application cases, we propose in this study two better techniques which improve them respectively with the same goal of minimizing the total overhead size, as described in the following.

4.4 Multi-Use Signatures Technique (MUST)

The basic idea of the enumerate-all-quotations technique is to include in the source signature GS the digital signatures for all possible quotations so that the size of the quotation signature Gq is small for any quotation. This is, however, overly inefficient for large S, and we describe in the following a multi-use signatures technique (MUST) where the size of Gq remains to be of the order O(1) for all quotations while the size of GS required is reduced to be of the order O(L) instead of O(L2).

Assume that S consists of L concatenated sentences denoted as s1 || s2 || s3 || … || sL

where

“||” specifies the concatenation operator, and that the length of each sentence is bounded, meaning that each sentence si is of the order O(1) in size. Also assumed is that the quotations include complete sentences, that is, q includes a set of consecutive sentences in S such that q =

s

a || sa+1 || … || sb where 1 ≤ a ≤ b ≤ L. This assumption is reasonable because quoting only a portion of a sentence may change its meaning significantly. If this is not the case, a remedy technique has also been proposed in this study, as described later in Section 4.6.2.

4.4.1 Generation of MUST Source and Quotation Signatures

The idea of the proposed MUST is for the source author to generate L multi-use

signatures g

j, where 1 ≤ j ≤ L, such that the signature gj can be used as part of the quotation signature for any of the quotations sa || sa+1 || … || sj, where 1 ≤ a ≤ j. In addition to the multi-use signatures, we also generate L cascaded hash values hj, where 1 ≤ j ≤ L, and include the hash value hj into the quotation signature for a quotation sj || sj+1 || … || sb, where j ≤ b ≤ L.

The generation of the multi-use signatures and cascaded hash values is described in detail in the following algorithm.

Algorithm 4.1: generation of a MUST source signature.

Input: a source document S consisting of L sentences s

j where 1 ≤ j ≤ L.

Output: a source signature G

S, which contains L cascaded hash values hj and L multi-use signatures gj where 1 ≤ j ≤ L.

Steps:

1. Set the first cascaded hash value to be the hash value of the entire source, that is, set

h

1 = H(S), where H(·) is some hash function.

2. For each j from 2 to L, compute the cascaded hash value hj as hj = H(hj-1 || H(sj-1)).

3. For 1 ≤ j ≤ L, calculate the multi-use signature gj as gj = Sign(hj || H(sj)), where Sign(·) is a signing function of some digital signature algorithm.

As a simple example, the cascaded hash values generated by the above algorithm for a source document with three sentences is as follows: h1 = H(S), h2 = H(h1 || H(s1)), and h3 = H(h2 || H(s2)). And the quotation signature for the quotation q = s3 contains h3 and g3, where g3

= Sign(h3 || H(s3)). Note that in detail h3 equals H(h2 || H(s2)) = H(H(h1 || H(s1)) || H(s2)) = H(H(H(S) || H(s1)) || H(s2)), which is a value derived by a cascade of concatenation and hashing operations, hence the name cascaded hash value.

4.4.2 Verification of MUST Quotation Signatures

When a document reader receives a quotation q = sa || sa+1 || … || sb and a quotation signature Gq that includes ha and gb, the following algorithm is proposed to verify the quotation.

Algorithm 4.2: quotation verification using MUST.

Input: a quotation q = s

a || sa+1 || … || sb, where a ≤ b, a cascaded hash value ha, and a multi-use signature gb.

Output: result of quotation verification.

Steps:

1. Perform the following steps to compute hb while a is smaller than b:

a. compute the cascaded hash value ha+1 to be H(ha || H(sa));

b. increment the value of a by 1.

2. Verify the quotation by Verify(hb || H(sb), gb) and output the result, where Verify is the reciprocal digital signature verification function of the Sign function used by the source author.

The above verification works for any quotation q = sa || sa+1 || … || sb, where 1 ≤ a ≤ b ≤ L, because the cascaded hash value hb calculated by Algorithm 4.2 using ha and q is the same as that calculated in Algorithm 4.1. This is easily seen since the computations performed by Step 1a of Algorithm 4.2 are the same as those by Step 2 of Algorithm 4.1.

Since the cascaded hash value hb is constructed by using cascaded operations of hashing and concatenation, an attacker needs to find exact collisions of hash values in order to craft a forged quotation as well as a matching quotation signature that can ensure the same hb to be constructed using Algorithm 4.2. The proposed technique is thus attack-resilient if the functions H and Sign are chosen properly, such as using the hash function SHA1 or SHA2 and the signing function RSA with sufficiently long keys [92], [97].

4.4.3 Total Overhead Size of MUST

A MUST source signature GS for a source document S with L sentences always contain exactly L cascaded hash values and L multi-use signatures, and so the size of GS is of the order O(L). A MUST quotation signature Gq always contain one cascaded hash value and one multi-use signature, and thus the size of Gq is of the order O(1). The total overhead size of the MUST with P document authors and Q document readers is thus O(PL + P + Q), or equivalently, O(PL + Q), contrasted with the total overhead size of O(PL2 + Q) for the basic enumerate-all-quotations technique.

4.5 Tree Root Uni-Signature Technique (TRUST)

The previously described quote-the-whole technique requires the complete source document S to be included in the Gq of a certain q. This is overly inefficient for large S, and the proposed improving technique, called tree root uni-signature technique (TRUST), is described in this section. The TRUST uses a tree-like construction of hash values such that the size of Gq can be reduced to be of the order O(log2

L). Only the root of the tree of the hash

values needs to be signed by the source author, thus maintaining the size of GS to be of O(1).

The previously described quote-the-whole technique requires the complete source document S to be included in the Gq of a certain q. This is overly inefficient for large S, and the proposed improving technique, called tree root uni-signature technique (TRUST), is described in this section. The TRUST uses a tree-like construction of hash values such that the size of Gq can be reduced to be of the order O(log2

L). Only the root of the tree of the hash

values needs to be signed by the source author, thus maintaining the size of GS to be of O(1).