以Microsoft Office文件作資訊隱藏之新研究

(1)

國

立

交

通

大

學

資訊學院

資訊科學與工程研究所

博

博士

士

士論

士

論

論文

論

文

以

以 Microsoft Office 文件作

文件作

資訊隱藏

資訊隱藏之

之

之新研究

之

新研究

A Study on New Techniques for Data Hiding

via Microsoft Office Documents

研

研究

究

究生

生

生: 劉

劉

劉宗

宗

宗原

宗

原

指

指導

導

導教

導

教

教授

授

授: 蔡

蔡文

蔡

文

文祥

祥

祥博士

博士

中華民國

中華民國九十九

九十九

九十九年

九十九

年

年七

年

七

七月

七

月

(2)

以

以 Microsoft Office 文件作

文件作

資訊隱藏之新研究

A Study on New Techniques for Data Hiding via

Microsoft Office Documents

研

研究

究生

究

生

生 : 劉

生

劉宗

劉

宗

宗原

原

Student: Tsung-Yuan Liu

指

指導

導

導教

教

教授

教

授

授 : 蔡

蔡文

蔡

文

文祥

文

祥

祥博士

博士

Advisor: Dr. Wen-Hsiang Tsai

國

國立

立

立交

交

交通

通

通大

大

大學

學資

學

資

資訊

訊

訊學

訊

學

學院

院

資

資訊

訊

訊科

科

科學

學

學與

與

與工

工程

工

程

程研

研

研究

研

究

究所

所

博

博士

士

士論

論

論文

論

文

A Dissertation Submitted to

Institute of Computer Science and Engineering

College of Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

in Computer and Information Science

July 2010

Hsinchu, Taiwan, 300

Republic of China

中華民國

九十九

年

年七

七

七月

月

(3)

以

以 Microsoft Office 文件作資訊隱藏之新研究

文件作資訊隱藏之新研究

研究生：劉宗原指導教授：蔡文祥博士

國立交通大學資訊學院

資訊科學與工程研究所

摘

摘要

要

數位資訊處理與網際網路技術的快速發展，使資訊隱藏技術的發展愈為重要，其應用也更多元化。目前之研究偏重在影像、聲音、影片等檔案中藏入資訊，但在產官學界經常產生、使用、互通之Microsoft Office文件卻少有人研究探討。該類檔案之格式及特性迥異於影像、聲音、影片等檔案，需要嶄新之方法以達到版權保護、資料驗證及秘密傳輸等目的，極具研究價值。本論文針對Microsoft Office文件探討其特性並提出了六個在Office文件隱藏資訊之研究範圍，包括於Microsoft Office文件之文字中隱藏資訊、於文字編排中隱藏資訊、於嵌入之多媒體物件中隱藏資訊、於嵌入物件編排方式中隱藏資訊、於Microsoft Office文件輔助數據資料中隱藏資訊以及於實體檔案格式中隱藏資訊。本論文亦提出了六種具體的新的資訊隱藏方法及應用，可適用於常見之Microsoft Word、Microsoft Excel、Microsoft PowerPoint以及Microsoft Visio等檔案類型。

首先，本論文針對Microsoft Office 文件可多人編輯之特性提出在Microsoft Word文件中利用追蹤修訂資訊以及賀夫曼編碼(Huffman Coding)技術隱藏秘密之新方法。針對文件內容常被轉載之應用，我們提出多重適用性簽章方法(MUST)以及單樹根簽章方法 (TRUST)兩種雜湊值及簽章處理方法並結合資訊隱藏技術以在Word文件中有效的達到轉貼資訊之來源驗證之目的，並提出二維多重適用性簽章方法(2D-MUST)及二維單樹根簽章方法(2D-TRUST)兩種二維雜湊值及簽章處理方法以在Microsoft Excel二維試算表文件中做轉貼表格之來源驗證，以及利用二維多重適用性簽章方法以偵測二維試算表文件內容可能遭竄改之應用。而針對文件內容常被剪貼、複製、收集之特性，本論文提出利用透明字元顏色及依權重統計偽隨機資訊隱藏順序之技術以在PowerPoint等文件中藏入隱密浮水印之新方法以達到來源追蹤等目的。另外，Microsoft Office 文件中常包含各

(4)

式影像、繪圖等，本論文提出了利用物件群組套疊關係以隱藏資訊以及利用創新的複合式一對一映射理論在影像中嵌入可逆式可視浮水印之新方法，而其中提出的可逆式可視浮水印方法可用於嵌入多種浮水印如單色不透明浮水印以及半透明全彩浮水印等。以上六種方法，皆為創新之作，實驗結果顯示論文提出的方法皆具有可行性及實用性。

(5)

A Study on New Techniques for Data Hiding via

Microsoft Office Documents

Student: Tsung-Yuan Liu

Advisor: Dr. Wen-Hsiang Tsai

Institute of Computer Science and Engineering

College of Computer Science

National Chiao Tung University

Abstract

With the advancement of digital information processing and Internet technologies, the field of data hiding has become more and more important, and their applications have become more and more diversified. Many techniques have been proposed for hiding data in images, videos, and audios, but there are relatively few researches devoted to data hiding in the popular Microsoft Office documents. Microsoft Office documents are in very different formats and have unique characteristics compared to images, videos, and audios, and so new techniques are needed for embedding data in such media for the purpose of copyright protection, covert communication, authentication, and so on. In this study, we investigate the characteristics of Microsoft Office documents pertaining to data hiding applications and identify six areas for researches of data hiding via such documents: data hiding via texts; data hiding via text formatting and layout; data hiding via multimedia contents; data hiding via multimedia formatting and layout; data hiding via auxiliary data; and data hiding via physical file formats. We also propose six specific new methods and applications for hiding data in Microsoft documents of Word, Excel, PowerPoint, and Visio.

First, exploiting the characteristic that documents can be written by multiple authors, a new method is proposed for embedding data in Microsoft Word documents for the purpose of covert communication by using change-tracking information and the Huffman coding technique. Then, to tackle the problem that contents in a document are often cited and included in another document and that there is a need to authenticate the fidelity and source of the cited content, a method is proposed in this study which combines data hiding techniques

(6)

with two different hash value processing techniques – MUST and TRUST – that can efficiently verify the fidelity of cited contents in a Word document. Furthermore, two two-dimensional hash value processing techniques 2D-MUST and 2D-TRUST are proposed that allow quotations of the form of a two-dimensional table from a Microsoft Excel spreadsheet to be authenticated. Also, the 2D-MUST is demonstrated to allow effective fidelity authentication and modification detection of spreadsheet contents. To address the characteristic that contents within Microsoft Office documents are often moved, copied, and collected together, a new method is proposed for embedding invisible watermarks into slide presentations for the purpose of source tracking by using blank space coloring and weighted voting techniques. Finally, via rich media such as drawings and images contained in Microsoft Office documents, two data hiding methods are proposed, with the first using the different nested grouping relationships of objects to embed information in Microsoft Visio drawings, and the second method using a new generic approach of compound one-to-one mappings to embed completely-removable visible watermarks into images. The latter method was shown to be able to embed opaque monochrome watermarks as well as translucent full color visible watermarks, which is the first in publications to the best of the author’s knowledge. Experimental results are included to demonstrate the feasibility of all the proposed methods.

(7)

Acknowledgements

I would like to express my sincere appreciation to my advisor, Professor Wen-Hsiang Tsai, for his patience and kind guidance throughout the course of this dissertation study. I would also like to acknowledge the very helpful comments and suggestions from the members of the oral defense committee and also those from the reviewers for parts of this dissertation that were submitted for journal publication. Thanks are also extended to the colleagues in the Computer Vision Laboratory at National Chiao Tung University for their valuable help and comments during this study.

I would also like to acknowledge the financial support received from the National Science Council and the ZyXEL Scholarship, as well as the support from my current employer, Google, during the course of this dissertation study.

Finally, I am so grateful to my parents, brothers, wife, and children for their love, support, and endurance. This dissertation is dedicated to them.

(8)

List of Tables

Table 1.1. A summary of requirements for different data hiding applications. ...2

Table 3.1. Occurrence probabilities and Huffman codes for the entries in an example degeneration set of “travel.” ...26

Table 3.2. Summary of common English errors database used as Rd...28

Table 3.3. Occurrence probabilities and Huffman codes of “study” and its synonyms. ...29

Table 3.4. Experimental results of message embedding capacity and increase in file size...30

Table 4.1. TRUST tree of hash values for five sentences...44

Table 4.2. Summary of total overhead sizes and signature sizes of the proposed techniques...50

Table 5.1. Hash values selected in the 2D-TRUST complementary hash set when quoting a cell s3, 2 from a 5×5 spreadsheet. ...58

Table 5.2. Summary of signature sizes and total overhead sizes of the proposed techniques...63

Table 6.1. Characteristics of presentations used in the experiments. ...77

Table 7.1. Distances between all pairs of objects in Figure 7.3. ...85

Table 7.2. Experimental results of embedding capacity for different drawings...91

Table 8.1. Characteristics of watermarks A through G used in experiments. ...108

Table 8.2. Comparison of reversible visible watermarking techniques... 111

(12)

List of Figures

Figure 2.1. A spam-like message generated with spammimic [16] with the secret message “Hello NCTU!” embedded. ...10 Figure 2.2. Illustration of slide designs. (a) A slide from a tutorial from Xilinx, Inc. with black

texts on white background; (b) the slide in (a) with a slide design template of bluish background applied...12 Figure 2.3. An image of Lena with a translucent full-color watermark “Globe” superimposed.14 Figure 2.4. A floor plan diagram of an office composed of different objects from stencils...16 Figure 3.1. Screenshot of Microsoft Word in a case of collaborative document authoring. ...20 Figure 3.2. Author A sends a stego-document S with embedded message M to a recipient B

after embedding M into a cover document D to form S that appears to be the

collaborative product of multiple authors A and A'. ...21 Figure 3.3. Huffman tree constructed by Algorithm 3.2 for the entries listed in Table 3.1...26 Figure 3.4. Extracts of stego-documents produced using the proposed method with databases

1 and 3. ...34 Figure 3.5. Extracts of stego-documents produced using the proposed method with databases

1 and 2. ...35 Figure 4.1. Illustration of processes performed by and information passed between a source

author, one or more document authors, and a document reader. ...39 Figure 4.2. A screenshot of Microsoft Word with the prototype add-in installed, which has

added buttons in the toolbar for the purpose of quotation authentication. ...48 Figure 5.1. Two-dimensional quotation in a spreadsheet document. (Source: Google Investor

Relations)...51 Figure 5.2. Illustration of cascaded hash value calculation for a cell sx, y in 2D-MUST. ...54

Figure 5.3. Experimental result of spreadsheet authentication using an add-in that implements the proposed 2D-MUST...61 Figure 6.1. Illustration of slide designs. (a) A slide from a tutorial from Xilinx, Inc. with black

texts on white background; (b) the slide in (a) with a slide design template of bluish background applied...65 Figure 6.2. Illustration of watermark image embedding using blank space coloring. ...67 Figure 6.3. Two series of watermark logos with different percentages of blocks reconstructed.68 Figure 6.4. Illustration of watermark reconstruction coverage. (a) Three watermarks each with

(13)

Figure 6.5. An experimental result of file format conversion. (a) Two slides in Microsoft PowerPoint. (b) The two slides after file format conversion from PPT to ODP and

back...77

Figure 6.6. Plot of average correct watermark pixel extractions from presentations constructed from randomly drawn slides. ...78

Figure 6.7. An experimental result of the three extracted watermarks with N ranging from 3 to 10. ...79

Figure 6.8. Normalized plot of average correct watermark pixel extractions from presentations constructed from randomly drawn slides. ...80

Figure 7.1. A floor plan diagram of an office composed of different objects from stencils...82

Figure 7.2. Illustration of object groupings for data embedding in a drawing. ...83

Figure 7.3. A simple drawing used as an example for embedding by object grouping...85

Figure 7.4. Resulting structure of object groupings of Figure 7.3 after embedding 1010010011. ...86

Figure 7.5. A network layout diagram used in the experiments (source: UCF). ...90

Figure 8.1. An illustration of mapping the center pixel of a 3×3 image using Algorithm 8.1. Only the mapping of the center pixel is shown for clarity; the east and south pixels are depicted as TBD (to be determined) in W. ...96

Figure 8.2. An illustration of pixels in a watermark. (a) A monochrome watermark. (b) Area of P (yellow pixels). (c) Area of P' (yellow pixels)...99

Figure 8.3. Experimental results of monochrome watermark embedding and removal. (a) Image Lena. (e) Image Sailboat. (b) and (f) Watermarked images of (a) and (e), respectively. (c) and (g) Images losslessly recovered from (b) and (f), respectively, with correct keys. (d) and (h) Images recovered from (b) and (f) with incorrect keys.100 Figure 8.4. Watermarked image of Lena with a translucent image of “Globe” superimposed using alpha blending...101

Figure 8.5. Illustration of pixel processing order in watermark embedding and removal. (a)-(d) Intermediate results of image watermarking when 25%, 50%, 75%, and 100% of the watermark pixels have been processed, respectively. (e)-(h) Intermediate results of image recovery when 25%, 50%, 75%, and 100% of the watermark pixels have been recovered, respectively. ...103

Figure 8.6. Test images used in experiments: (a) Lena; (b) Baboon; (c) Jet; (d) Sailboat; (e) A satellite image of NCTU campus; and (f) Pepper...107

(14)

Figure 8.8. Average values of PSNRW obtained after watermark embedding and average

values of PSNRRand PSNRQ obtained after illicit image recoveries. (a) Results yielded

by parameter randomization. (b) Results yielded by mapping randomization...109 Figure 8.9. Watermarked images, licitly recovered images, and illicitly recovered images.

(a)-(c) Watermarked images. (d)-(f) Licitly recovered images from images (a)-(c), respectively. (g)-(i) Illicitly recovered images from images (a)-(c), respectively. ... 110 Figure A.1. Code excerpt demonstrating the manipulation of a Microsoft Word document

(15)

Chapter 1 Introduction

1.1 Scope of Data Hiding Research

Data hiding is the study of embedding data into various media such that the information is accessible for later uses. The media into which data are embedded are called cover media1, such as cover documents, cover images, and cover videos, and the resulting media with the data embedded are usually called stego-media2, such as stego-documents, stego-images, and stego-videos. The data embedded into various media can be used for various applications, including covert communication, copyright protection, data association, media authentication, and so on [1]. The above-mentioned applications are described in the following.

1. Covert communication – data is hidden imperceptibly into a cover medium for the purpose of secret communication. For example, a sender may wish to transmit a secret message to a receiver secretly and so employs appropriate data hiding techniques to embed the message into a stego-medium. In this way, only the intended receiver can identify and retrieve the data hidden in the stego-medium. Data hiding for the purpose of covert communication is sometimes called steganography. Contrast to data protection techniques such as encryption that prevents observers from knowing the secret being transmitted, the goal of steganography is to conceal the very act of secret message transmission from outside observers.

2. Copyright protection – the data embedded into a cover medium is used to identify the copyright holder of the medium. The data embedded is usually called a watermark, and can be visible or invisible. A visible watermark advertises the copyright holder directly on a stego-medium, and can deter attempts of copyright infringements as a result, though at the cost of degrading the quality of the cover medium. On the other hand, a cover medium embedded with an invisible watermark is usually of a higher quality, but the copyright holder will need to prove the presence of the invisible watermark after the act of copyright infringement. In both the visible and invisible cases, the embedded watermark should be robust to removal attacks.

1

Cover media can also be called host media or carrier media in the literature.

2

(16)

3. Data association – the data embedded into a cover medium is the related information, such as the metadata, origin, and change history of the medium. Data hidden in this way facilitate the transmission and storage of the stego-medium along with the related information. It is desirable for the embedded data to be resilient against modifications to the stego-medium by programs unaware of the data association application.

4. Media authentication – the data embedded in a cover medium is used to verify the fidelity or the integrity of the stego-medium. This is important when a stego-medium needs to be transmitted over an insecure channel, where an attacker can alter the contents of the stego-medium.

A summary of the requirements for the above-mentioned data hiding applications is shown in Table 1.1 below. Is it noted that even if an attacker is not confident in the presence of hidden data in the covert communication application, a cautious attacker (often called an

active warden in the literature) can nonetheless choose to perform small-scale transformations on all passing media. In this way, if there were no hidden information in the media, the transformations would be harmless to the communicating parties in normal circumstances, but if there were hidden information embedded, such transformations may be able to destroy the embedded data.

Table 1.1. A summary of requirements for different data hiding applications.

Removal robustness Data hiding application Imperceptibility

Deliberate Non-deliberate

Covert communication

Y

*

Copyright protection

Y

Data association

Y

Media authentication

N

In this study we investigate data hiding techniques for office documents, which are digital files used by the collection of programs in office software suites such as Microsoft Office and OpenOffice, with more emphasis on Microsoft Office documents. A number of

(17)

office document types are in popular use today. The popularity of office documents is largely due to the versatile nature of the documents and the wide spread installment of office software suites. Some of the most common office document types are described in the following.

1. Word processing documents – these are very generic documents that can be used for all sorts of printable materials. It can be used, for example, in businesses for preparing letters, contracts, and reports, or in colleges for homework and publications. Such documents are most commonly processed using the applications “Microsoft Word” and “OpenOffice Writer”, and are mostly commonly saved in the proprietary DOC format, the Office Open XML format (with the file extension “.docx”), or the OpenDocument text format (with the file extension “.odt”). Today’s word processors allow text to be freely styled, and offer many productivity enhancing tools such as automatic detection or correction of typographical, spelling, or grammatical errors; automatic generation of tables of contents and lists of tables and figures; style management for format consistency; change tracking and commenting for multi-author collaboration; and so on.

2. Spreadsheet documents – these contain sheets of two-dimensional arrays of cells that simulate accounting worksheets. Such documents are most commonly used for numeric calculations or visualization of numerical values, and are mostly commonly processed using the applications “Microsoft Excel” (most commonly saved with the file extensions “.xls” or “.xlsx”) and “OpenOffice Calc” (saved with file extension “.ods”).

3. Slide presentation documents – these contain printable or projectable slides that can be used to aid a presentation. Such documents usually contain rich colors, animations, and various multimedia such as videos, images, drawings, or audios to emphasis the key points delivered via short sentences or phrases in the slides. The most commonly used presentation editing and viewing applications are “Microsoft PowerPoint” (which typically saves presentations with the file extensions “.ppt”, “.pptx”, or “.pps”) and “OpenOffice Impress” (which typically saves presentations with the file extension “.odp”).

4. Vector graphics documents – these are created using applications such as “Microsoft Visio” (most commonly saved with the file extension “.vsd”) and “OpenOffice Draw” (saved with file extension “.odg”). Drawing applications typically supply templates containing various graphical objects as well as intelligent connectors connecting the

(18)

objects to allow efficient drawing of diagrams such as flowcharts and system architectures.

In the following sections, the motivation of this study is given in Section 1.2, followed by the contributions of this study in Section 1.3. Finally, the organization of this dissertation is described in Section 1.4.

1.2 Motivation of Study

Microsoft Office documents and other office documents are widely used in industry, government, and academia. Searching in Google reveals that over one hundred million of such documents are accessible online. Despite the popularity of office documents, currently there are few data hiding researches that address such documents compared to other media such as images, videos, and audios [2]-[6]. One reason is that earlier office documents are in proprietary binary formats. The only mention of data hiding in office documents known to the author that predates our study [7] did not attempt to understand the document format but instead just utilized the slack space at the end of the file or scanned in the binary file for consecutive bytes of 00’s or FF’s and replaced them with the intended payload for the purpose of covert communication.

Another reason why there is relatively little research in data hiding via office documents is that office documents can be extremely complex and can contain an assortment of heterogeneous, rich contents. The Office Open XML file format, for example, is a four-part ISO/IEC 29500:2008 standard [8] that contains 5560, 129, 40, and 1464 pages, respectively.

In this study, we focus on techniques that manipulate office documents at the logical level instead of modifying the underlying physical file formats. Such an approach is more generic and simpler, such that the proposed techniques can be applied to various office document formats. Also, data embedded into an office document in this way can often survive common editing operations as well as file format conversions. Details of accessing and manipulating office documents in the logical way can be found in Appendix A.

For data hiding to be effective, the characteristics of a cover medium must be taken into account so that the embedded data can be suitably blended with the cover medium for various desired purposes. For example, embedding data into an image would require different data hiding techniques than embedding data into an audio. Also, data hiding for different purposes has different considerations and hence requires different data hiding techniques. An invisible watermark embedded in a stego-image for the copyright protection purpose, for example,

(19)

should be robust against common image operations such as resizing, cropping, and format conversion [9], [10]. On the other hand, data embedded for the purpose of media authentication do not need to be resilient against such modifications.

Office documents have several characteristics that are unique compared to other media such as images or videos, and require new techniques or approaches for effective data hiding. One characteristic of office documents is that they are frequently processed and manipulated by multiple parties. Examples of multi-party collaborations include workflows with office documents as attachments; collaborative authoring of journal manuscripts; and filling-in of forms in the office document formats. The collaborative nature of such use cases facilitates the application of steganography since it is natural for office documents to be transmitted between the collaborating parties. Data hiding applications such as data association and content authentication are also important considerations in such collaborative cases. In this dissertation study, we investigate data hiding techniques and their applications that take into consideration the collaborative nature of office documents.

Another important distinction between office documents and other cover media is the relative ease in copying, editing, and moving around parts of a document. This is especially evident in, for example, slide presentations, where the ordering of slides can be easily changed by drag-and-drop operations using a mouse. Also, it is common to compose a new set of slides by copying slides from several previously-authored slide presentations. The ease of reusing and manipulating portions of an office document poses challenges in copyright protection, data association, and media authentication applications, which are investigated in this study.

In summary, the goal of this research is to study the properties of office documents and propose new data hiding techniques that are suitable for office documents. Studies of data hiding via office documents are still few so far but are of great theoretical as well as practical importance because office documents are very popular and are created, transmitted, and consumed worldwide every day.

1.3 Contributions of This Study

In this study, we investigate and discuss the properties of office documents pertaining to data hiding applications and identify regions of office documents that may be used for data hiding applications. New techniques and approaches for hiding data via office documents are then proposed with applications that range from covert communication, authentication, data

(20)

association, to copyright protection and for office document types that range from word-processing documents, spreadsheets, slide presentations, to drawings (a summary of the proposed methods can be found in Table 9.1 in the last chapter). In more details, the contributions of this study are as follows.

1. The properties of office documents pertaining to data hiding applications are investigated and six areas for researches of data hiding via office documents, including data hiding via texts; data hiding via text formatting and layout; data hiding via multimedia contents; data hiding via multimedia formatting and layout; data hiding via auxiliary data; and data hiding via physical file formats, are identified and discussed. 2. A new approach to covert communication is proposed by using change-tracking

information, where the data embedding is disguised such that the stego-document appears to be the product of a collaborative writing effort. Text segments in a document are degenerated, mimicking to be the work of an author with inferior writing skill, with the secret message embedded in the choices of degenerations. The degenerations are then revised with the changes being tracked, making it appear as if a cautious author is correcting the mistakes. The change-tracking information contained in the stego-document allows the original cover, the degenerated document, and hence the secret message, to be recovered. It is proposed to use the Huffman coding technique for determining the choices of degeneration to make the method more innocuous, which is important for the application of covert communication. Also, one of the strengths of the proposed approach is that the extra change-tracking information added during message embedding is vital in a normal collaboration scenario, and so hinders ignorant removals by skeptics. Experimental results in Microsoft Word are presented to demonstrate the feasibility of the proposed method.

3. The problem of quotation authentication is investigated in this study to tackle the problem that contents in a document are often cited and included in another document and there is a need to authenticate the fidelity and source of the cited content. A new approach is proposed in this study that combines data hiding techniques with two different hash value processing techniques – the Multi-Use Signatures Technique (MUST) and the Tree-Root Uni-Signature Technique (TRUST) – that can efficiently verify the fidelity of cited contents in a document. Experimental results in Microsoft Word are presented to demonstrate the feasibility of the proposed method.

4. The problem of two-dimensional quotation authentication is described in this study. Such quotations can come from tables in a Word document or spreadsheets from Excel

(21)

documents. Two two-dimensional hash value processing techniques 2D-MUST and 2D-TRUST are proposed that allows efficient generation and verification of signatures required for authentication of two-dimensional quotations. Furthermore, it is shown that the 2D-MUST can be used for effective authentication and modification detection of spreadsheet content, and experimental results in Microsoft Excel are presented to demonstrate the feasibility of the proposed technique.

5. A new method is proposed for embedding invisible watermarks into office documents to address the characteristic that contents within office documents are often moved, copied, and collected together. In more detail, a watermark image is embedded imperceptibly into the slides of a slide presentation by partitioning the watermark into blocks and embedding them into the space characters existing in the slides in a repeating pseudo-random sequence. The embedding is achieved by changing the colors of the space characters into new ones which are results of encoding the contents and indices of the blocks. The embedded watermark is resilient against many common modifications on slides, including copying and pasting of slides; insertion, deletion and reordering of slides; slide design changes; and file format conversions. A security key is used during embedding and extraction of a watermark, such that if a presentation contains slides taken from presentations watermarked with different security keys, each watermark can be extracted reliably in turn with the respective key using a weighted voting technique also proposed in this study.

6. Data hiding in the drawings and images that can be embedded in office documents is also investigated in this study and two methods are proposed for drawings and images, respectively. The first proposed method embeds information into the structure of object groupings in a drawing. The objects in a drawing are grouped skillfully according to the data being embedded and the geometrical relationship between the objects. The groupings of objects in the stego-drawing are visually imperceptible, and the resulting stego-drawing is robust against translation, scaling, and rotation attacks. The proposed method can be used for data hiding applications such as drawing authentication and covert communication. Experiments conducted on Microsoft Visio drawings confirm the feasibility of the proposed method.

7. A new approach to lossless reversible visible watermarking in images is also proposed, in which deterministic one-to-one compound mappings of the pixel values in an image to those of a watermarked image is performed in such a way that the mappings tend to yield pixel values close to those of the desired visible watermark, making the resulting

(22)

visible watermark more distinctive. The compound mappings are proved reversible, to allow lossless recovery of the original image from the watermarked image. Different types of visible watermarks can be embedded as applications of the proposed generic approach, and two applications have been described where opaque monochrome watermarks as well as translucent color watermarks are embedded. Security protection measures by parameter and mapping randomizations have also been proposed to deter attackers from illicit image recoveries. Experimental results proving the effectiveness of the proposed approach as well as the invariability of the method when the images are embedded into Microsft Word or PowerPoint documents are included.

1.4 Dissertation Organization

In the remainder of this dissertation, the six regions for data hiding in office documents are explored in Chapter 2, along with surveys of related studies. In Chapter 3, the proposed method for data hiding via change-tracking information and Huffman coding is described. In Chapter 4, the new proposed approach to text quotation authentication is described, while the two-dimensional case is presented in Chapter 5. In Chapter 6, the proposed method for embedding invisible watermarks in slides of a presentation is described. In Chapter 7, the proposed method for hiding data in the structure of drawing object groupings is described. In Chapter 8, the proposed new approach for embedding removable visible watermarks into images is described. Finally, in the last chapter, conclusions of this study and some suggestions for future research are included.

(23)

Chapter 2 Six Areas for Researches of Data Hiding via Office

Documents and Surveys of Related Studies

Office documents are very versatile and contain many types of contents, including rich text, images, drawings, videos, or even other office documents. We describe below six research directions using office documents for data hiding applications, and describe related works that can be used for hiding in the different regions.

2.1 Data Hiding via Texts

Most office documents contain texts, and so data hiding techniques such as linguistic

steganography [11] that apply to the text itself can be used for data hiding via office documents. One approach to data hiding via texts is to generate the text content directly based on the data to be embedded, which is sometimes called text mimicking. By storing the text generated in such a way in an office document, the document can be used for the covert communication purpose because the intended receiver can easily extract the text contained within the office document and decode the text to extract the secret message contained therein.

A number of methods have been proposed in the past for text mimicking, such as using probabilistic context-free grammars [12], [13] for generating grammatically correct (though sometimes illogical) texts; or using several predefined sentence structures with swappable verbs, adverbs, adjectives, and other parts of speech [14]-[15] for embedding information. Figure 2.1 shows an example of a message generated by spammimic [16], a web-based steganography tool that uses context-free grammars to generate spam-like texts, with the secret message “Hello NCTU!” embedded.

Another approach to data hiding via texts is to apply semantically equivalent transformations of the text based on the embedded message. Examples include replacing words with their synonyms [15], [17]; performing syntactic transformations [18] like passivization (rendering a sentence into the passive form) and clefting (changing a simple

(24)

sentence into a complex sentence with a main clause and a dependent clause)3 [19] on a sentence’s structure with little effect on its meaning; or performing one-way or multi-way4 machine translations on a text [20]-[21].

Dear Friend , Especially for you - this breath-taking news . If you are not interested in our publications and wish to be removed from our lists, simply do NOT respond and ignore this mail . This mail is being sent in compliance with Senate bill 1916 ; Title 7 ; Section 302 . THIS IS NOT MULTI-LEVEL MARKETING ! Why work for somebody else when you can become rich as few as 33 days ! Have you ever noticed society seems to be moving faster and faster & nobody is getting any younger ! Well, now is your chance to capitalize on this !

WE will help YOU increase customer response by 160% plus process your orders within seconds ! You can begin at absolutely no cost to you ! But don't believe us ! Ms Jones who resides in New Hampshire tried us and says "My only problem now is where to park all my cars" ! We are licensed to operate in all states ! We beseech you - act now . Sign up a friend and your friend will be rich too ! Thank-you for your serious consideration of our offer !

Figure 2.1. A spam-like message generated with spammimic [16] with the secret message “Hello NCTU!” embedded.

A third approach to data hiding via texts is to use invisible characters or make small-scale modifications to the text so that the change is not noticeable. Examples of this approach include adding or removing spaces before or after punctuations and symbols; using two spaces instead of one and vice versa; introducing occasional typos or misspellings [22]; inserting non-visible special characters such as unused ASCII codes [23]-[24], directional formatting codes, or Unicode joiner characters [25]; or replacing characters by identically or similarly looking alternatives such as replacing a space by its non-breaking version [26] or using alternative character sequences that produce identical rendering [27].

As mentioned previously, it is possible to apply the aforementioned techniques for data hiding via office documents. The feasibility of this approach is demonstrated in Chapter 3,

3

An example of passivization is to change the sentence “Renee gave a speech” into “A speech was given by Renee,” and an example of clefting is to change the sentence “We are looking for Biwi” into “It is Biwi whom we are looking for.”

4

For example, translating a text from English to Chinese and then back to English, or from English to Japanese, to Korean, and then to French.

(25)

where the text in a Microsoft Word document is modified in a certain way using some of the techniques described above for the steganography application. In addition, since techniques of data hiding via texts sometimes produce illogical texts that are susceptible to human inspection, it is proposed to leverage the collaborative nature of office document editing to make covert communication more effective in face of steganalysis [28]-[30] and active warden attacks [31], which are discussed in more detail in Chapter 3.

2.2 Data Hiding via Text Formatting and Layout

Office documents such as Microsoft Word documents allow very flexible formatting and layout of texts, including the precise controllability of text font sizes, colors, cases, styles, and effects; selection of various list and numbering options; flexible adjustability of inter-word, inter-line, and inter-paragraph spacings as well as line, tab, and paragraph indentations; setting of page and section margins; and so on.

It is possible to embed information into an office document by making small adjustments to the above-mentioned attributes in ways similar to those proposed for other media types. For example, Maxemchuk et al. [32]-[33] proposed to shift word and line spacings slightly (such as by 1/150 or 1/300 inch) in a document image to embed information; Zhong et al. [34] modified the spacing between characters within a line in a PDF to embed data; Villán et al [35] proposed to use color quantization to store data in electronic or printed documents; and Walton [36] described a technique of replacing the least-significant bits (LSBs) of the pixels of a cover image to embed information. Specifically, instead of shifting word or line spacings in a text image, we can modify the word or line spacing attributes in an office document slightly to embed information. And instead of changing the LSB values of pixel values, we could instead change the LSB values of the text color values in an office document.

Figure 2.2 shows an example of applying the technique of LSB replacement on text colors in an office document, where the word “Partial” in the first bullet-point in a slide is changed from completely black to a very dark gray. Such a modification is imperceptible, as seen in the left slide in the figure. The right slide in Figure 2.2 shows the result of applying

automatic style formatting to the left slide, where the white background is changed into a dark blue one, and the black text color is changed into white. In this case, the data previously embedded using LSB replacement is still intact since the color remains unchanged as dark gray. However, the color modification is no longer imperceptible.

(26)

The challenge of using LSB replacement for data hiding via office documents in the presence of automatic style formatting as well as other attacks such as copying-and-pasting of contents are discussed in more detail in Chapter 6, and a novel technique is proposed for effective data hiding in slide presentations.

= Sign Extension -23₂2₂1₂0 C0= 1 0 0 1 (-7) X0= 0 1 1 1 ( 7) X ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 (-49) -23₂2₂1₂0 C1= 0 1 1 0 ( 6) X1= 0 1 0 1 ( 5) X 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 ( 30) 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1 + + + + (-1) (-14) (-4) (0) (-19) (Serial-Data / Tap-Parallel Multiply)

Distributed Arithmetic

for a 2-Tap Filter

• Partialproducts of equal weight are added together before being summed to next higher partial product weight

• Create look-up table of summed partial products

(a) = Sign Extension -23₂2₂1₂0 C0= 1 0 0 1 (-7) X0= 0 1 1 1 ( 7) X ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 (-49) -23₂2₂1₂0 C1= 0 1 1 0 ( 6) X1= 0 1 0 1 ( 5) X 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 ( 30) 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1 + + + + (-1) (-14) (-4) (0) (-19)

(Serial-Data / Tap-Parallel Multiply)

Distributed Arithmetic

for a 2

for a 2--Tap FilterTap Filter

PartialPartialproducts of equal weight are added together before products of equal weight are added together before being summed to next higher partial product weight

being summed to next higher partial product weight

Create lookCreate look--up table of summed partial productsup table of summed partial products

(b)

Figure 2.2. Illustration of slide designs. (a) A slide from a tutorial from Xilinx, Inc. with black texts on white background; (b) the slide in (a) with a slide design template of bluish background applied.

2.3 Data Hiding via Multimedia Contents

Office documents can contain an assortment of multimedia contents such as drawings, images, videos, and audios. Office software suites that are in common use today typically cannot manipulate audio or video contents, so these media are often stored as standalone files and an office document simply stores a reference to the external file. On the other hand, drawings and images are usually embedded directly into an office document for ease of manipulation, transmission, and storage. It is thus possible to achieve data hiding via office documents by applying existing data hiding techniques for drawings and images and then embed them into an office document.

For example, since the embedding of text images such as hand-written signatures or specially-styled headlines into an office document is a plausible scenario, we can apply techniques proposed for covert communication via text images [37]-[39] on these embedded images for the purpose of conveying a secret message via office documents. Such an approach is desirable as the sending of a hand-written signature by itself is relatively improbable compared to the case of embedding it in an office document. Also, steganalysis of a text

(27)

image inside an office document is computationally more expensive than processing a stand-alone text image.

Compared to the cases of embedding text images, it is more common to embed various color images into an office document such as a slide presentation to illustrate or emphasize the key points mentioned in the document. One can thus use techniques proposed for embedding secret information into color images [40], [41] for the steganography application using a similar method as that mentioned previously. One may also use techniques proposed for embedding watermarks into images [42]-[48] for the copyright protection application by embedding watermarked images into an office document.

Digital watermarking methods for images are usually categorized into two types:

invisible and visible5. The first type aims to embed copyright information imperceptibly into host media such that in cases of copyright infringements, the hidden information can be retrieved to identify the ownership of the protected host. It is important for the watermarked image to be resistant to common image operations to ensure that the hidden information is still retrievable after such alterations. Methods of the second type, on the other hand, yield visible watermarks which are generally clearly visible after common image operations are applied. In addition, visible watermarks convey ownership information directly on the media and can deter attempts of copyright violations.

Embedding of watermarks, either visible or invisible, degrades the quality of the host media in general. A group of techniques, named reversible watermarking [49]-[59], allow legitimate users to remove the embedded watermark and restore the original content as needed. However, not all reversible watermarking techniques guarantee lossless image recovery, which means that the recovered image is identical to the original, pixel by pixel. Lossless recovery is important in many applications where serious concerns about image quality arise. Some examples include forensics, medical image analysis, historical art imaging, or military applications.

Compared with their invisible counterparts, there are relatively few mentions of lossless visible watermarking in the literature. Several lossless invisible watermarking techniques have been proposed in the past. The most common approach is to compress a portion of the original host and then embed the compressed data together with the intended payload into the host [52]-[54]. Another approach is to superimpose the spread-spectrum signal of the payload on

5

There is also the “cocktail” watermarking scheme [48] that embeds both types of watermarks simultaneously into an image, which makes it harder for an attacker to remove both types of watermarks.

(28)

the host so that the signal is detectable and removable [42]. A third approach is to manipulate a group of pixels as a unit to embed a bit of information [55]-[57]. Although one may use lossless invisible techniques to embed removable visible watermarks [51], [58], the low embedding capacities of these techniques hinder the possibility of implanting large-sized visible watermarks into host media.

As to lossless visible watermarking, the most common approach is to embed a monochrome watermark using deterministic and reversible mappings of pixel values or DCT coefficients in the watermark region [50], [59]. Another approach is to rotate consecutive watermark pixels to embed a visible watermark [59]. One advantage of these approaches is that watermarks of arbitrary sizes can be embedded into any host image. However, only

binary visible watermarks can be embedded using these approaches, which is too restrictive since most company logos are colorful.

In Chapter 8, we describe a new method for lossless visible watermarking which allows the embedding of different types of visible watermarks into cover images, including the embedding of non-uniformly translucent full-color ones such as that illustrated in Figure 2.3 below. Such watermarks provide significantly better advertising effects than traditional monochrome ones when the images are embedded within office documents.

(29)

2.4 Data Hiding via Multimedia Formatting and Layout

In addition to hiding data inside the multimedia content themselves, it is also possible to leverage the formatting or the layout of the multimedia content embedded in an office document for data hiding applications. For example, images are often created in external programs and then embedded in office documents. For convenience, office application suites often allow these images to be adjusted, including their brightness and contrast values, size of appearances, amounts of cropping for the four edges, and positioning properties. Many of these formatting or layout properties may be used for various data hiding applications.

On the other hand, drawings are often created inside an office document using the office application software. Also, such drawings are usually vector drawings that contain objects of different shapes and sizes with uniform or gradient fills. Data hiding in vector drawings is comparatively less studied compared to data hiding in images, due to the relatively low information content in such a kind of media that can be manipulated.

Data hiding in a vector drawing is most commonly achieved by altering the geometry or positioning of the shapes in the drawing to embed data, the manipulation of which can be done in the spatial domain or in one of the transform domains such as DFT, DWT, and DCT [60]-[66]. Kwon et al. [61] embedded invisible watermark signals into lines, arcs, and circles in a CAD drawing by modifying their lengths, angles, and radii, respectively. Detection of the watermark, however, requires the use of the original drawing. Solachidis and Pitas [62] achieved blind watermark detection by modifying the coordinates of the vertices in a polygonal line using Fourier descriptors. The embedded watermark is resilient to scaling, rotation, and translation attacks, but vulnerable to distortion attacks. The method was later enhanced by Doncel et al. [63]. Im et al. [64] proposed the use of wavelet descriptors for embedding watermarks that are robust against global and local geometrical distortions.

It is noted that techniques that manipulate the internal coordinates of a shape itself cannot be applied to drawings such as flowcharts, network topologies, floor plans, and circuit diagrams, because objects in these diagrams come from stencils. Figure 2.4 shows an example of a floor plan drawing created in Microsoft Visio, where the shapes representing desks, chairs, servers, walls, doors, etc. all come from standard stencils, and cannot be individually altered. In Chapter 7, we describe how to manipulate the way that drawing objects are embedded in a Microsoft Visio drawing for data hiding applications.

Another technique for data hiding in multimedia formatting is that proposed by Yang and Chen [67], where the animation effects of objects in a Microsoft PowerPoint presentation are

(30)

modified according to an animation codebook to embed a secret message for the steganography application. The work was later extended by Jing et al. [68] by further leveraging the animation timing effect variations for message embedding. One advantage of these techniques and that proposed in Chapter 7 is that the main content in the document is not distorted during message embedding. Another advantage is that these techniques can in general be used in conjunction with each other to extend the data hiding capacity as well as increase the complexity of steganalysis.

Figure 2.4. A floor plan diagram of an office composed of different objects from stencils.

2.5 Data Hiding via Auxiliary Data

Another approach to data hiding via an office document is simply to store information inside document metadata [69] such as the author, organization, description, and keyword fields that generally allow arbitrary information to be entered and stored. Liu et al. [70] proposed to store a secret message inside the notes pages of a Microsoft PowerPoint document. The embedding is made innocuous by generating the notes based on the sentences contained in the slides.

A type of interesting auxiliary data that can be embedded into an office document is program code, or macro [71]. Normal uses of macros can make document processing easier and more efficient [72], but it can also be used for new approaches to active data hiding [73].

(31)

However, since malicious codes such as viruses and worms can easily be embedded into macros, their uses are being limited by anti-virus software applications as well as the office applications themselves.

The technique of embedding information inside document auxiliary data is suitable for data hiding applications such as data association or media authentication (the technique is used in Chapter 4 and Chapter 5 for exactly these purposes), but is in general undesirable for applications such as copyright protection. This is because document metadata can usually be modified or removed easily without affecting the main content of the document, insofar as Microsoft has provided detailed how-to documents as well as tools [74], [75] for removing information embedded in the metadata of an office document.

2.6 Data Hiding via Physical File Formats

Data hiding via the physical file format of office documents has gained research traction recently, thanks to Microsoft’s adoption of standardized file formats and opening-up of previous proprietary binary formats. One approach to data hiding via physical document files is to utilize unused spaces such as slack spaces at the end of data streams in a file [76] or redundant data that are created during consecutive file updates to a document [77], [78].

Another approach to data hiding via physical file formats is to exploit the

forward-compatible nature of the document format, that is, application software will typically silently ignore unknown data blocks encountered while reading a file. Park et al. [79] described how unknown parts and unknown relationships in the Office Open XML documents (which is a zipped file containing XML documents and other supporting files6) can be used for steganography applications.

Finally, since the standard-based office document formats such as Office Open XML and OpenDocument are (compressed) XML files, one may use data hiding techniques proposed for XML files on such documents. For example, the five techniques proposed by Inoue et al. [80] for embedding data into XML documents may be applied to office documents for data hiding applications: 1) alternate representation of empty elements; 2) use of white spaces in tags; 3) utilizing the order of appearance of elements; 4) utilizing the order of appearance of attributes; and 5) alternate representation of elements that can contain other elements.

6

(32)

2.7 Summary

In this chapter we presented six areas for data hiding via office documents and point out related works that can be used for data hiding in each area. The first five areas (i.e., data hiding in text, data hiding in text formatting and layout, data hiding in multimedia content, data hiding in multimedia formatting and layout, and data hiding in auxiliary data) can be regarded as data hiding in the logical regions of the office document. These techniques are in general more resilient to common operations performed on office documents compared to techniques that exploit the physical file format directly. One reason is that an office software application has no obligation to preserve the content and structure of non-user-visible data. This is especially true in file format conversions, such as converting a file between the Office Open XML and the OpenDocument formats, where unknown and hence unconvertible contents are simply discarded.

In the subsequent chapters, we focus on data hiding in the logical areas of the office document (i.e., data hiding via texts, data hiding via text formatting and layout, data hiding via multimedia contents, data hiding via multimedia formatting and layout, and data hiding via auxiliary data) as opposed to hiding in the physical file formats. Experimental results are included to demonstrate that the proposed data hiding techniques can indeed embed data that survive attacks such as file format conversions and common editing operations. A summary of the areas utilized for data hiding via office documents for each of the proposed method can be found in Table 9.1 in the last chapter.

(33)

Chapter 3 A New Steganographic Method for Data Hiding in

Microsoft Word Documents by a Change-Tracking

Technique

3.1 Introduction

Office documents are sometimes written by multiple authors who may be physically distant from each other. To facilitate communications between the authors during the collaborative document authoring process, a word processor such as Microsoft Word can be used to record the exact modifications performed by an author and embeds the ways of revisions as change-tracking information into the document. From such change-tracking information, one can discern the exact changes made by a prior author, and can recover a prior version of the document if necessary.

Figure 3.1 shows an example of the collaborative document authoring process in Microsoft Word, where an author is modifying a document and the word processor is tracking the author’s modifications. The modifications by the author are clearly marked, with the deleted words stroked-through and newly inserted text underlined7. Formatting changes are displayed as comment bubbles at the right side margin of the page. Each collaborating author can accept or reject each of the modifications made by another author. It is a common practice for a collaborating author to review and then accept or reject each of the modifications in a document first before performing his/her own corrections.

We have chosen Microsoft Word documents as cover media, which provide change-tracking facilities to materialize the proposed method. Communications via Word documents are commonplace for personal, business, or academic purposes nowadays, so transmissions of such documents will not be under close scrutiny. We note that any other document format that offer change-tracking facilities, such as OpenDocument, can also be

7

There are commands and options to change the ways the modifications are displayed, for example showing deletions and insertions as comment bubbles at the side or listing the modifications line-by-line in a separate panel.

以Microsoft Office文件作資訊隱藏之新研究

國

立

交

通

大

學

資訊學院

資訊學院

資訊學院

資訊學院

資訊科學與工程研究所

資訊科學與工程研究所

資訊科學與工程研究所

資訊科學與工程研究所

博

博

博

博 士

士

士 論

士

論

論 文

論

文

文

文

以

以

以

以 Microsoft Office 文件作

文件作

文件作

文件作

資訊隱藏

資訊隱藏

資訊隱藏

資訊隱藏之

之

之新研究

之

新研究

新研究

新研究

A Study on New Techniques for Data Hiding

via Microsoft Office Documents

研

研

研

研 究

究

究

究 生

生

生

生: 劉

劉

劉

劉 宗

宗

宗 原

宗

原

原

原

指

指

指

指 導

導

導 教

導

教

教

教 授

授

授

授: 蔡

蔡 文

博士

士論

論文

研究

究生

劉宗

宗原

指導

導教

教授

蔡文

文祥

祥博士

中華民國九十九

九十九年

年七

七月

研究

究生

劉宗

宗原

指導

導教

教授