Dissertation Organization - 以Microsoft Office文件作資訊隱藏之新研究

Chapter 1 Introduction

1.4 Dissertation Organization

In the remainder of this dissertation, the six regions for data hiding in office documents are explored in Chapter 2, along with surveys of related studies. In Chapter 3, the proposed method for data hiding via change-tracking information and Huffman coding is described. In Chapter 4, the new proposed approach to text quotation authentication is described, while the two-dimensional case is presented in Chapter 5. In Chapter 6, the proposed method for embedding invisible watermarks in slides of a presentation is described. In Chapter 7, the proposed method for hiding data in the structure of drawing object groupings is described. In Chapter 8, the proposed new approach for embedding removable visible watermarks into images is described. Finally, in the last chapter, conclusions of this study and some suggestions for future research are included.

Chapter 2 Six Areas for Researches of Data Hiding via Office Documents and Surveys of Related Studies

Office documents are very versatile and contain many types of contents, including rich text, images, drawings, videos, or even other office documents. We describe below six research directions using office documents for data hiding applications, and describe related works that can be used for hiding in the different regions.

2.1 Data Hiding via Texts

Most office documents contain texts, and so data hiding techniques such as linguistic

steganography [11] that apply to the text itself can be used for data hiding via office

documents. One approach to data hiding via texts is to generate the text content directly based on the data to be embedded, which is sometimes called text mimicking. By storing the text generated in such a way in an office document, the document can be used for the covert communication purpose because the intended receiver can easily extract the text contained within the office document and decode the text to extract the secret message contained therein.

A number of methods have been proposed in the past for text mimicking, such as using probabilistic context-free grammars [12], [13] for generating grammatically correct (though sometimes illogical) texts; or using several predefined sentence structures with swappable verbs, adverbs, adjectives, and other parts of speech [14]-[15] for embedding information.

Figure 2.1 shows an example of a message generated by spammimic [16], a web-based steganography tool that uses context-free grammars to generate spam-like texts, with the secret message “Hello NCTU!” embedded.

Another approach to data hiding via texts is to apply semantically equivalent transformations of the text based on the embedded message. Examples include replacing words with their synonyms [15], [17]; performing syntactic transformations [18] like passivization (rendering a sentence into the passive form) and clefting (changing a simple

sentence into a complex sentence with a main clause and a dependent clause)³ [19] on a sentence’s structure with little effect on its meaning; or performing one-way or multi-way⁴ machine translations on a text [20]-[21].

Dear Friend , Especially for you - this breath-taking news . If you are not interested in our publications and wish to be removed from our lists, simply do NOT respond and ignore this mail . This mail is being sent in compliance with Senate bill 1916 ; Title 7 ; Section 302 . THIS IS NOT MULTI-LEVEL MARKETING ! Why work for somebody else when you can become rich as few as 33 days ! Have you ever noticed society seems to be moving faster and faster & nobody is getting any younger

! Well, now is your chance to capitalize on this ! WE will help YOU increase customer response by 160%

plus process your orders within seconds ! You can begin at absolutely no cost to you ! But don't believe us

! Ms Jones who resides in New Hampshire tried us and says "My only problem now is where to park all my cars"

! We are licensed to operate in all states ! We beseech you - act now . Sign up a friend and your friend will be rich too ! Thank-you for your serious consideration of our offer !

Figure 2.1. A spam-like message generated with spammimic [16] with the secret message

“Hello NCTU!” embedded.

A third approach to data hiding via texts is to use invisible characters or make small-scale modifications to the text so that the change is not noticeable. Examples of this approach include adding or removing spaces before or after punctuations and symbols; using two spaces instead of one and vice versa; introducing occasional typos or misspellings [22];

inserting non-visible special characters such as unused ASCII codes [23]-[24], directional formatting codes, or Unicode joiner characters [25]; or replacing characters by identically or similarly looking alternatives such as replacing a space by its non-breaking version [26] or using alternative character sequences that produce identical rendering [27].

As mentioned previously, it is possible to apply the aforementioned techniques for data hiding via office documents. The feasibility of this approach is demonstrated in Chapter 3,

3 An example of passivization is to change the sentence “Renee gave a speech” into “A speech was given by Renee,” and an example of clefting is to change the sentence “We are looking for Biwi” into “It is Biwi whom we are looking for.”

4 For example, translating a text from English to Chinese and then back to English, or from English to Japanese, to Korean, and then to French.

where the text in a Microsoft Word document is modified in a certain way using some of the techniques described above for the steganography application. In addition, since techniques of data hiding via texts sometimes produce illogical texts that are susceptible to human inspection, it is proposed to leverage the collaborative nature of office document editing to make covert communication more effective in face of steganalysis [28]-[30] and active warden attacks [31], which are discussed in more detail in Chapter 3.

2.2 Data Hiding via Text Formatting and Layout

Office documents such as Microsoft Word documents allow very flexible formatting and layout of texts, including the precise controllability of text font sizes, colors, cases, styles, and effects; selection of various list and numbering options; flexible adjustability of inter-word, inter-line, and inter-paragraph spacings as well as line, tab, and paragraph indentations;

setting of page and section margins; and so on.

It is possible to embed information into an office document by making small adjustments to the above-mentioned attributes in ways similar to those proposed for other media types. For example, Maxemchuk et al. [32]-[33] proposed to shift word and line spacings slightly (such as by 1/150 or 1/300 inch) in a document image to embed information; Zhong et al. [34]

modified the spacing between characters within a line in a PDF to embed data; Villán et al [35]

proposed to use color quantization to store data in electronic or printed documents; and Walton [36] described a technique of replacing the least-significant bits (LSBs) of the pixels of a cover image to embed information. Specifically, instead of shifting word or line spacings in a text image, we can modify the word or line spacing attributes in an office document slightly to embed information. And instead of changing the LSB values of pixel values, we could instead change the LSB values of the text color values in an office document.

Figure 2.2 shows an example of applying the technique of LSB replacement on text colors in an office document, where the word “Partial” in the first bullet-point in a slide is changed from completely black to a very dark gray. Such a modification is imperceptible, as seen in the left slide in the figure. The right slide in Figure 2.2 shows the result of applying

automatic style formatting to the left slide, where the white background is changed into a dark

blue one, and the black text color is changed into white. In this case, the data previously embedded using LSB replacement is still intact since the color remains unchanged as dark gray. However, the color modification is no longer imperceptible.

The challenge of using LSB replacement for data hiding via office documents in the presence of automatic style formatting as well as other attacks such as copying-and-pasting of contents are discussed in more detail in Chapter 6, and a novel technique is proposed for effective data hiding in slide presentations.

= Sign Extension

• Partialproducts of equal weight are added together before being summed to next higher partial product weight

• Create look-up table of summed partial products

(a)

PartialPartialproducts of equal weight are added together before products of equal weight are added together before being summed to next higher partial product weight being summed to next higher partial product weight

Create look-Create look-up table of summed partial productsup table of summed partial products

(b)

Figure 2.2. Illustration of slide designs. (a) A slide from a tutorial from Xilinx, Inc. with black texts on white background; (b) the slide in (a) with a slide design template of bluish background applied.

2.3 Data Hiding via Multimedia Contents

Office documents can contain an assortment of multimedia contents such as drawings, images, videos, and audios. Office software suites that are in common use today typically cannot manipulate audio or video contents, so these media are often stored as standalone files and an office document simply stores a reference to the external file. On the other hand, techniques proposed for covert communication via text images [37]-[39] on these embedded images for the purpose of conveying a secret message via office documents. Such an approach is desirable as the sending of a hand-written signature by itself is relatively improbable compared to the case of embedding it in an office document. Also, steganalysis of a text

image inside an office document is computationally more expensive than processing a stand-alone text image.

Compared to the cases of embedding text images, it is more common to embed various color images into an office document such as a slide presentation to illustrate or emphasize the key points mentioned in the document. One can thus use techniques proposed for embedding secret information into color images [40], [41] for the steganography application using a similar method as that mentioned previously. One may also use techniques proposed for embedding watermarks into images [42]-[48] for the copyright protection application by embedding watermarked images into an office document.

Digital watermarking methods for images are usually categorized into two types:

invisible and visible

⁵. The first type aims to embed copyright information imperceptibly into host media such that in cases of copyright infringements, the hidden information can be retrieved to identify the ownership of the protected host. It is important for the watermarked image to be resistant to common image operations to ensure that the hidden information is still retrievable after such alterations. Methods of the second type, on the other hand, yield visible watermarks which are generally clearly visible after common image operations are applied. In addition, visible watermarks convey ownership information directly on the media and can deter attempts of copyright violations.

Embedding of watermarks, either visible or invisible, degrades the quality of the host media in general. A group of techniques, named reversible watermarking [49]-[59], allow legitimate users to remove the embedded watermark and restore the original content as needed.

However, not all reversible watermarking techniques guarantee lossless image recovery, which means that the recovered image is identical to the original, pixel by pixel. Lossless recovery is important in many applications where serious concerns about image quality arise.

Some examples include forensics, medical image analysis, historical art imaging, or military applications.

Compared with their invisible counterparts, there are relatively few mentions of lossless visible watermarking in the literature. Several lossless invisible watermarking techniques have been proposed in the past. The most common approach is to compress a portion of the original host and then embed the compressed data together with the intended payload into the host [52]-[54]. Another approach is to superimpose the spread-spectrum signal of the payload on

5 There is also the “cocktail” watermarking scheme [48] that embeds both types of watermarks simultaneously into an image, which makes it harder for an attacker to remove both types of watermarks.

the host so that the signal is detectable and removable [42]. A third approach is to manipulate a group of pixels as a unit to embed a bit of information [55]-[57]. Although one may use lossless invisible techniques to embed removable visible watermarks [51], [58], the low embedding capacities of these techniques hinder the possibility of implanting large-sized visible watermarks into host media.

As to lossless visible watermarking, the most common approach is to embed a monochrome watermark using deterministic and reversible mappings of pixel values or DCT coefficients in the watermark region [50], [59]. Another approach is to rotate consecutive watermark pixels to embed a visible watermark [59]. One advantage of these approaches is that watermarks of arbitrary sizes can be embedded into any host image. However, only

binary visible watermarks can be embedded using these approaches, which is too restrictive

since most company logos are colorful.

In Chapter 8, we describe a new method for lossless visible watermarking which allows the embedding of different types of visible watermarks into cover images, including the embedding of non-uniformly translucent full-color ones such as that illustrated in Figure 2.3 below. Such watermarks provide significantly better advertising effects than traditional monochrome ones when the images are embedded within office documents.

Figure 2.3. An image of Lena with a translucent full-color watermark “Globe” superimposed.

2.4 Data Hiding via Multimedia Formatting and Layout

In addition to hiding data inside the multimedia content themselves, it is also possible to leverage the formatting or the layout of the multimedia content embedded in an office document for data hiding applications. For example, images are often created in external programs and then embedded in office documents. For convenience, office application suites often allow these images to be adjusted, including their brightness and contrast values, size of appearances, amounts of cropping for the four edges, and positioning properties. Many of these formatting or layout properties may be used for various data hiding applications.

On the other hand, drawings are often created inside an office document using the office application software. Also, such drawings are usually vector drawings that contain objects of different shapes and sizes with uniform or gradient fills. Data hiding in vector drawings is comparatively less studied compared to data hiding in images, due to the relatively low information content in such a kind of media that can be manipulated.

Data hiding in a vector drawing is most commonly achieved by altering the geometry or positioning of the shapes in the drawing to embed data, the manipulation of which can be done in the spatial domain or in one of the transform domains such as DFT, DWT, and DCT [60]-[66]. Kwon et al. [61] embedded invisible watermark signals into lines, arcs, and circles in a CAD drawing by modifying their lengths, angles, and radii, respectively. Detection of the watermark, however, requires the use of the original drawing. Solachidis and Pitas [62]

achieved blind watermark detection by modifying the coordinates of the vertices in a polygonal line using Fourier descriptors. The embedded watermark is resilient to scaling, rotation, and translation attacks, but vulnerable to distortion attacks. The method was later enhanced by Doncel et al. [63]. Im et al. [64] proposed the use of wavelet descriptors for embedding watermarks that are robust against global and local geometrical distortions.

It is noted that techniques that manipulate the internal coordinates of a shape itself cannot be applied to drawings such as flowcharts, network topologies, floor plans, and circuit diagrams, because objects in these diagrams come from stencils. Figure 2.4 shows an example of a floor plan drawing created in Microsoft Visio, where the shapes representing desks, chairs, servers, walls, doors, etc. all come from standard stencils, and cannot be individually altered. In Chapter 7, we describe how to manipulate the way that drawing objects are embedded in a Microsoft Visio drawing for data hiding applications.

Another technique for data hiding in multimedia formatting is that proposed by Yang and Chen [67], where the animation effects of objects in a Microsoft PowerPoint presentation are

modified according to an animation codebook to embed a secret message for the steganography application. The work was later extended by Jing et al. [68] by further leveraging the animation timing effect variations for message embedding. One advantage of these techniques and that proposed in Chapter 7 is that the main content in the document is not distorted during message embedding. Another advantage is that these techniques can in general be used in conjunction with each other to extend the data hiding capacity as well as increase the complexity of steganalysis.

Figure 2.4. A floor plan diagram of an office composed of different objects from stencils.

2.5 Data Hiding via Auxiliary Data

Another approach to data hiding via an office document is simply to store information inside document metadata [69] such as the author, organization, description, and keyword fields that generally allow arbitrary information to be entered and stored. Liu et al. [70]

proposed to store a secret message inside the notes pages of a Microsoft PowerPoint document. The embedding is made innocuous by generating the notes based on the sentences contained in the slides.

A type of interesting auxiliary data that can be embedded into an office document is program code, or macro [71]. Normal uses of macros can make document processing easier and more efficient [72], but it can also be used for new approaches to active data hiding [73].

However, since malicious codes such as viruses and worms can easily be embedded into macros, their uses are being limited by anti-virus software applications as well as the office applications themselves.

The technique of embedding information inside document auxiliary data is suitable for data hiding applications such as data association or media authentication (the technique is used in Chapter 4 and Chapter 5 for exactly these purposes), but is in general undesirable for applications such as copyright protection. This is because document metadata can usually be modified or removed easily without affecting the main content of the document, insofar as Microsoft has provided detailed how-to documents as well as tools [74], [75] for removing information embedded in the metadata of an office document.

2.6 Data Hiding via Physical File Formats

Data hiding via the physical file format of office documents has gained research traction recently, thanks to Microsoft’s adoption of standardized file formats and opening-up of previous proprietary binary formats. One approach to data hiding via physical document files is to utilize unused spaces such as slack spaces at the end of data streams in a file [76] or redundant data that are created during consecutive file updates to a document [77], [78].

Another approach to data hiding via physical file formats is to exploit the

forward-compatible nature of the document format, that is, application software will typically

silently ignore unknown data blocks encountered while reading a file. Park et al. [79]

described how unknown parts and unknown relationships in the Office Open XML documents (which is a zipped file containing XML documents and other supporting files⁶) can be used for steganography applications.

Finally, since the standard-based office document formats such as Office Open XML and OpenDocument are (compressed) XML files, one may use data hiding techniques proposed for XML files on such documents. For example, the five techniques proposed by Inoue et al.

[80] for embedding data into XML documents may be applied to office documents for data hiding applications: 1) alternate representation of empty elements; 2) use of white spaces in tags; 3) utilizing the order of appearance of elements; 4) utilizing the order of appearance of attributes; and 5) alternate representation of elements that can contain other elements.

6 This is also true for the OpenDocument format.

2.7 Summary

In this chapter we presented six areas for data hiding via office documents and point out

在文檔中以Microsoft Office文件作資訊隱藏之新研究 (頁 22-0)