Chapter 1 Introduction
1.5 Thesis Organization
In the remainder of this thesis, a review of related works about data hiding, secret sharing, and the PDF standard is described in Chapter 2. In Chapter 3, the proposed method for secret transmission is described. In Chapter 4, the proposed secret authentication method for fidelity verification in PDF documents is described. In Chapter 5, the proposed secret sharing method for PDF documents based on the proposed data hiding techniques is described. Finally, conclusions and some suggestions for future works are given in Chapter 6.
Chapter 2
Review of PDF Standard and Survey of Related Works
2.1 Introduction
Since we use PDF files as cover media to implement the data hiding techniques which we proposed in this study, we need to know the framework of the PDF first. A review of the PDF standard will be described in Section 2.2.
In Section 2.3, a review of existing techniques for data hiding in PDF files will be described, and then a review of existing techniques for secret sharing is presented in Section 2.4.
2.2 Review of PDF Standard
2.2.1 Overview
The Adobe portable document format (PDF) is a file format of the Adobe®
Acrobat® family of products [1]. The contents in PDF files are described by a context-free grammar which is modified from PostScript®. PDF is a file format for representing documents in a manner independent of the application software, hardware, and operating systems used to create them and of the output device on which they are to be displayed or printed.
The basic elements in the PDF are objects. A PDF document consists of a
collection of objects that together describe the appearance of one or more pages of the PDF. A document’s pages can contain any combination of text, graphics, and images.
A page’s appearance is described by a PDF content stream, which contains a sequence of graphics objects to be painted on the page.
2.2.2 Basic Types of Objects
A PDF document is a data structure composed from a small set of basic types of data objects. Objects may be labeled so that they can be referred to by other objects. A labeled object is called an indirect object. PDF supports eight basic types of objects:
Boolean values, integer and real numbers, strings, names, arrays, dictionaries, streams, and the null object. Each object type and the indirect object are briefly described below.
1. Boolean Objects
The PDF provides boolean objects identified by the keywords true and false.
2. Numeric Objects
The PDF provides two types of numeric objects: integer and real. The range and precision of numbers are limited by the internal representations used in the computer on which the PDF consumer application is running. For example, 17, +99, −3. 18.9, −.005, 0.0, and 5. are all numeric objects.
3. String Objects
A string object consists of a series of bytes — unsigned integer values in the range 0 to 255. The length of a string may be subject to implementation limits. String objects can be written in two ways:
(1) As a sequence of literal characters enclosed in parentheses ( ). For example: (This is a string).
(2) As hexadecimal data enclosed in angle brackets < >. For example:
< 762073686D602E >.
4. Name Objects
A name object is a symbol uniquely defined by a sequence of characters. A slash character (/) introduces a name. The slash is not part of the name but is a prefix indicating that the following sequence of characters constitutes a name. For example: /NameOne.
5. Array Objects
An array object is a one-dimensional collection of objects arranged sequentially. An array’s elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays. An array is written as a sequence of objects enclosed in square brackets ([ and ]). For example: [true 17.417 (Hello World) /SomeName].
6. Dictionary Objects
A dictionary object is an associative table containing pairs of objects, known as the dictionary’s entries. The first element of each entry is the key and the second element is the value. The key must be a name and the value can be any kind of object, including another dictionary. A dictionary is written as a sequence of key-value pairs enclosed in double angle brackets (<< … >>).
For example: << /Type /Example /Version 3.02
/String1 (Hello World)
/Array1 [1 0 0 1]
>>
7. Stream Objects
A stream object, like a string object, is a sequence of bytes. But it’s unlike a string is subject to an implementation limit; a stream can be of unlimited length. A stream consists of a dictionary followed by zero or more bytes bracketed between the keywords stream and endstream. For examples:
<<…>> object. There is only one object of type null, denoted by the keyword null.
2.2.3 Indirect Objects
Any object in a PDF file may be labeled as an indirect object. This gives the object a unique object identifier by which other objects can refer to it. The object identifier consists of two parts as follows.
1. A positive integer object number. Indirect objects are often numbered sequentially within a PDF file, but this is not required; object numbers may be assigned in any arbitrary order.
2. A non-negative integer generation number. In a newly created file, all indirect objects have generation numbers of 0. Nonzero generation numbers
may be introduced when the file is later updated.
The combination of an object number and a generation number uniquely identifies an indirect object. The definition of an indirect object in a PDF file consists of its object number and generation number, followed by the value of the object bracketed between the keywords obj and endobj. The object can be referred to from elsewhere in the file by an indirect reference consisting of the object number, the generation number, and the keyword R. An example of indirect objects is as follows:
5 0 obj
The example defines an indirect object with an object number of 5, a generation number of 0 and a stream object with its dictionary. According to the dictionary of the stream object, the length of the stream refers to another indirect object which has an object number of 3 and a generation number of 0.
2.2.4 File Structure
In this section, we will describe how objects are organized in a PDF file. A canonical PDF file initially consists of four elements (see Figure 2.1) briefly described as follows.
1. A one-line header identifying the version of the PDF specification to which the file conforms.
For example, for a file conforming to PDF 1.4, the header should be
“%PDF-1.4.”
2. A body consists of a sequence of indirect objects representing the contents of a document.
3. A cross-reference table containing information about the indirect objects in the file.
The cross-reference table is the only part of a PDF file with a fixed format.
The table comprises one or more cross-reference sections. Each section begins with a line containing the key word xref. Following this line are one or more cross-reference subsections. Each subsection begins with a line containing two numbers separated by a space: the object number of the first object in this subsection and the number of entries in the subsection.
Following this line are the cross-reference entries themselves, one per line.
Each entry containing a 10-digit byte offset, a 5-digit generation number and a key word n (for an in-use entry) or f (for a free entry). The 10-digit byte offset gives the number of bytes from the beginning of the file to the beginning of the object when the entry is in use. If the entry is free, it gives the object number of the next free object. The following example has 7
4. A trailer giving the location of the cross-reference table and of certain special objects within the body of the file.
Figure 2.1 Initial structure of a PDF file.
2.3 Review of Techniques for Data Hiding in PDF files
Techniques for data hiding have been studied for a long time and applied to various media of various image formats, videos, and documents [2-9]. The PDF becomes one of the most popular formats for people to exchange information nowadays and several techniques for data hiding in PDF files have been proposed in recent years [10-13].
Zhong and Chen [10] proposed an information steganography algorithm on PDF
documents by hiding data between indirect objects of PDF documents. The algorithm can embed data of unlimited lengths into PDF documents and the embedded PDF documents keep transparent when being displayed in PDF readers. Zhong, Cheng and Chen [11] proposed a steganographic technique for hiding data in a kind of PDF English texts. They modified integer numerals which are used to position characters in the PDF text. Because the perceptual difference is very small, people cannot be aware of the hidden data in the PDF document. Liu et al. [12] proposed an algorithm based on equivalent transformations in PDF files. They discovered that the effect of the page display of a PDF file is extraneous to the seriation of the dictionary’s entries so that data hiding can be achieved by special array of entries, instead of by operations of adding any other data to the cover PDF. Wang and Tsai [13] proposed a data hiding method by slight modifications of the values of various PDF object parameters, yielding a difference of appearance very difficult to notice by human vision.
2.4 Review of Techniques for Secret Sharing
Secret sharing is a method for distributing a secret into several shares which are then distributed to some participants. Each of them keeps one of the shares and every share is meaningless alone. Only when a pre-defined threshold number of these shares are collected together can the secret be recovered.
Blakley [14] was the first to publish an approach to solving the secret sharing problem. His is a probabilistic approach based on linear projective geometry. Each vi
specifies a hyperplane and the secret s is the unique point of intersection of the n hyperplanes. Shamir [15] proposed a simple and efficient secret sharing scheme
which is called a (k, n)-threshold scheme, where k is the threshold number of how many shares should be collected at least and n indicates the number of participants.
Lin and Tsai [16] proposed an efficient (n, n)-threshold method by using exclusive-OR operations. It simply applies the exclusive-OR operation to a secret image as well as n − 1 other images to generate the nth image. The n − 1 images and the nth image are all regarded as shares and are distributed to n participants, respectively. The secret image can be recovered only by exclusive-ORing the n images which are kept by the n participants. Huang and Tsai [17] [18] proposed secret sharing methods for pure texts, HTML documents, and e-mail documents. For pure texts, they transformed a secret text into several shares, which are meaningful articles and can be authenticated. For HTML documents, the method was designed to extract important parts of the components in a secret HTML document and share them, and then transform the share data into an HTML document of the same appearance of the secret HTML document so that each share is still a HTML-type share. For e-mail documents, a secret e-mail is encoded and distributed into several authenticable e-mail shares by hierarchical sharing with data magnitude control, steganography methods, and authentication techniques. In addition, a secret H.264/AVC video sharing scheme was proposed by Huang and Tsai [19] to extract prediction modes from given cover videos and the secret video, then share the intra-prediction modes of the secret video based on the exclusive-OR operation, and hide finally the resulting share data into the prediction modes of the cover videos.
Chapter 3
Secret Transmission via PDF Files by Space Coding and Insertion of
Invisible Texts
3.1 Introduction
With the development of network techniques, people can exchange information by writing e-mails or sending files on the Internet. Secret messages can even be transmitted by data hiding techniques, using some types of files as cover media. There are many data hiding methods for secret transmission via images, videos, and other types of files. Because the PDF has become a very popular file format nowadays, it also is a suitable type of file for use in secret transmission as cover media.
The proposed method for secret transmission via PDF files is described in this chapter. In Section 3.2, the proposed two data hiding techniques are described. In Section 3.3, the proposed data hiding processes are described, and the recovery processes are stated in Section 3.4. In Section 3.5, several experimental results are shown and a summary and some discussions of the proposed method are made in Section 3.6.
3.2 Data Hiding Techniques in PDF Files
In this section, the proposed two different kinds of data hiding techniques are
described. The first is based on a space coding scheme and the other is based on a scheme of inserting invisible texts into a cover PDF.
3.2.1 Data Hiding in PDF Files by Space Coding
As mentioned in Chapter 1, white-space characters are used to separate syntactic constructs from one another in the PDF file. According to the PDF standard, many distinct characters are treated as white-space characters. Table 3.1 shows this property of the PDF.
All white-space characters are equivalent, except in comments, strings, and streams. So that if white-space characters are not in comments, strings, and streams, they all available as white spaces in PDF documents. We can use this property to embed secret messages into cover PDFs.
More specifically, according to Table 3.1 a white-space character may have six
kinds of codes, which are the hexadecimal numbers 00, 09, 0A, 0C, 0D, and 20. After some experiments, we found out that in some text editors, 0C will show as a line and 0A will cause line feeding on the display of the original code of a PDF document. If we use these two codes to embed secret messages, people may be aware of the existence of the hidden data more easily. So we only use 00, 09, 0D and 20 to embed the secret message. Thus, if the four kinds of white-space characters are not in comments, strings, and streams, they are all usable to embed the secret message.
Accordingly, a white-space character has four different codes to use now, so we can embed 2 bits of message data using a single white-space character. More specifically, we use the hexadecimal codes 00, 09, 0D and 20 to represent the 2-bit message data 00, 01, 10, and 11, respectively.
Table 3.2 Proposed data encoding scheme. space characters to embed the message and the codes of them are 00, 20, 09, 20, and 0D, respectively.
Besides, in order to tell how many white-space characters have been modified for data hiding, we use two bytes to embed the length of the secret message before embedding the secret data. These two bytes will be used in the message data recovery process.
Since we do not insert any other data into the cover PDF in the above-described message data embedding process, the size of the PDF file will not change so that it is difficult for a reader of the displayed stego-PDF file to be aware of the existence of the hidden data. But obviously the capacity of the data which we embed, called data embedding capacity in the sequel, is limited by how many usable white-space characters does the cover PDF have.
3.2.2 Data Hiding in PDF Files by Insertion of Invisible Texts
In this section, the other proposed data hiding technique based on a scheme of inserting invisible texts into a cover PDF is described. The basic idea is that we can insert into the PDF file some text matrices whose coordinates are outside the visible area of the PDF so that the corresponding texts will not show on the displayed PDF document. The details are described as follows.
By definition, a text matrix is used to set the state of the corresponding text and locate it in a PDF file. The structure of a text matrix is shown below:
a b c d e f Tm
where a through f are all numbers and Tm indicates the end of the text matrix. The first four numbers “a b c d ” are used for text scaling, rotation, and skew in the following way.
1. Scaling is obtained by “Sx 0 0 Sy”. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as Sx and Sy units, respectively, in the previous coordinate system.
2. Rotations are produced by “cosθ sinθ −sinθ cosθ”, which has the effect of rotating the coordinate system axes by an angle θ counterclockwisely.
3. Skew is specified by “1 tanα tanβ 1”, which skews the x-axis by an angle α and the y-axis by an angle β.
The initial values of a, b, c and d are 1, 0, 0 and 1, respectively. The other two numbers e and f are the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively. The proposed method is designed to transform secret data into decimal numbers. The transformation specified by Table 3.3.
After the transformation, the secret data become a long string of decimal digits, which we regard as a big integer number. Then we process this integer number, say denoted as N, by distinct ways according to the following three conditions.
1. Set e to N if N does not cause an overflow and set f to 0. (Note that, as mentioned in Section 2.2.2, the range and precision of numbers are limited by the internal representations used in the computer on which the PDF consumer application is running. So when we set e to N, an overflow may occur.)
2. If N causes an overflow, separate N into two numbers N1and N2. Let N be n1n2n3…nknk+1…nl, where ni is a decimal digit, i is from 1 to l, l is the number of decimal digits in N, and k is the largest number which does not make N causing an overflow. Then, set n1n2n3…nk as N1 and nk+1…nl as N2; and treat N1 as e and N2 as f.
3. If either N1 or N2 or both of them cause overflows, insert more text matrices to embed the secret data.
Table 3.3 The transformation between binary bits and decimal numbers.
Bit stream Decimal number Bit stream Decimal number
000 1 100 5 boundaries of the physical medium on which the page is intended to be displayed or printed. In short, the MediaBox decides the visible area of the page. A common visible area of a PDF page is 595×842. In order to guarantee that the position of texts is outside the visible area to create invisibility to the observer, we concatenate “999”
before the decimal numbers. An example is given as follows.
Suppose the secret message is “010001110.” Then, according to Table 3.3, we can transform it into “327”. And then we concatenate “999” before “327”, yielding
“999327”. Finally, we insert a text matrix and put “999327” in it. The final text matrix is shown below:
1 0 0 1 999327 0 Tm.
In addition, in order to reduce the file size, the PDF supports two compression filters, LZW and Flate compressions, for the content streams describing texts and graphics in the PDF document. So when we want to insert the text matrices, we need to decompress the page’s content stream first, and after inserting the text matrices in
which we embed the secret message, we should compress the modified page’s content stream again by the default compression filter.
Because we insert more data in the cover PDF, the offset of each indirect object and the offset of the cross-reference table may change. A wrong offset of the indirect objects in the cross-reference table and trailer can cause a wrong display of the PDF document. So we need to update the cross-reference table and trailer of the cover PDF to get the desired stego-PDF. More specifically, suppose we embed the secret message in an indirect object B to get B′. Since we insert more data in B, the size of B′ is bigger than B. So the offsets of the indirect objects whose location are behind B need to be updated by increasing them for a value D which is the difference of the size between B and B′. And if the cross-reference table is also behind B, the trailer needs to be updated by the same way, too.
Because we insert more data in the cover PDF, the offset of each indirect object and the offset of the cross-reference table may change. A wrong offset of the indirect objects in the cross-reference table and trailer can cause a wrong display of the PDF document. So we need to update the cross-reference table and trailer of the cover PDF to get the desired stego-PDF. More specifically, suppose we embed the secret message in an indirect object B to get B′. Since we insert more data in B, the size of B′ is bigger than B. So the offsets of the indirect objects whose location are behind B need to be updated by increasing them for a value D which is the difference of the size between B and B′. And if the cross-reference table is also behind B, the trailer needs to be updated by the same way, too.