Chapter 4 Quotation Authentication: A New Approach and Efficient Solutions by Data Hiding
4.6 Data Hiding Techniques for Quotation Authentication
4.6.2 Integrated Authenticable-Quotation
An authenticable-quotation created by the prototype implementation carries three different pieces of information. The first is the quotation itself and the second is the human-readable source author identity, both of which should be conveyed to the document reader directly. The third is an appropriate quotation signature that can be processed by the add-in to verify the fidelity and source of the quotation, and should ideally be invisible to the reader.
We propose to store the source author information in the visible comments section of a Word document, as shown in Figure 4.2. Such comments are usually displayed at the right hand side in Microsoft Word and clearly recognizable. A comment in Microsoft Word can be attached to any range of texts in the document, and we use this property to demarcate the span of a quotation. Also, each comment in a Word document has an author field to help distinguish different commentators during collaborative editing. This author field is utilized in this study to mark comments that are used for quotation authentication by using a special value of “AUTH”. Finally, if a document author quotes an incomplete sentence, we include
the unquoted parts of the sentence in the comments as well for the reader’s reference, and also to allow reconstruction of complete sentences for the purpose of quotation authentication.
We propose to embed the authentication information invisibly into an authenticable-quotation by first transforming the binary data into normal text by Base64 encoding to avoid misinterpretation. The transformed text is then inserted just after the first character of the quotation and made invisible by setting its “Font Effects” to “Hidden” [99]. It has been verified by experiments that this technique can be used to embed arbitrary long information invisibly into a Microsoft Word document.
The integrated authenticable-quotation that contains the quotation itself, the visible comment, and the invisible authentication information as described is generated automatically by the add-in by a click of the button. It has been verified in experiments that the authenticable-quotation is always copied in its entirety in copy-and-paste operations, making it easy for a document author to include such authenticable-quotations when composing documents.
On receiving a document containing authenticable-quotations, a document reader can verify the quotations by clicking on the “QA Verify” button installed by the authentication add-in, which scans the document for comments that contain the special author field ‘AUTH’
and performs the following verification for each authenticable-quotation:
1. identify the span of a quotation by the range covered by the comment;
2. extract the hidden authentication information in the quotation and reversely transform the text version back to the binary original by Base64 decoding;
3. if there exists partial sentences in the comment then extract them and reconstruct the full quotation sentences;
4. verify the fidelity of the quotation using the proposed quotation verification technique;
and format the comment of the quotation to show the result of quotation verification, as illustrated in Figure 4.2.
4.7 Summary
The problem of quotation authentication is described in this chapter, and a new approach to solving the problem has been proposed that allows document readers to efficiently verify quotations cited from known sources but embedded in messages by untrusted document authors. The proposed approach only requires the three parties involved in the problem to perform simple steps, without requiring a trusted third party to endorse quotations or requiring
a document reader to access the original source document. Specifically, a source author is allowed to generate an appropriate source signature, such that any document author can generate a suitable quotation signature for arbitrary quotations from the source. The quotation signature is bundled together with the quotation using the proposed data hiding techniques to form an integrated authenticable-quotation that can easily be copied and pasted to any document. Finally, a document reader can identify any authenticable-quotations present in a document, and efficiently verify the source and the fidelity of the quotation.
We have started by describing the basic enumerate-all-quotations and quote-the-whole techniques, followed by the multi-use signatures technique and the tree root uni-signature technique that allow more efficient generation of source and quotation signatures. The two techniques have their respective merits depending on whether the message is widely distributed or not. The MUST is more efficient if there are a large number of document readers for a document, while the TRUST is better otherwise. The total overhead sizes and the signature sizes of the proposed techniques are summarized in Table 4.2 below.
Also, specific data hiding techniques suitable for embedding source and quotation signatures in Microsoft Word documents have been proposed to demonstrate the feasibility of the proposed techniques. Furthermore, add-ins that can be installed in the Microsoft Word applications were described, which allows the three parties of the quotation authentication problem to perform their tasks easily.
Table 4.2. Summary of total overhead sizes and signature sizes of the proposed techniques.
Technique |GS| |Gq| Total overhead size
Enumerate-all-quotations
O(L
2)O(1) O(PL
2 + Q)Quote-the-whole
O(1) O(L) O(PL + QL)
MUST
O(L) O(1) O(PL + Q)
TRUST
O(1) O(log
2L) O(Plog
2L + Qlog
2L)
Chapter 5
Quotation Authentication and Content
Authentication for Spreadsheet Documents
5.1 Introduction
The discussions in Chapter 4 assumed that documents and quotations contain texts that flow consecutively. However, contents in some documents are not sequential texts. For example, a Microsoft Excel spreadsheet document contains sheets of two-dimensional data, or
cells. A typical quotation from a spreadsheet document is not sequential cells but a
two-dimensional cutout of the cells. There are many large spreadsheets that contain information suitable for quoting, for example company financial statements, results of national voting, sales figures, and so on. It is common for a portion of such a large spreadsheet be quoted and included as a table in a Microsoft Word document.As an example, a company’s income statement may list the details of the revenue and expense figures in rows, with the values for different years or quarters across columns, as shown in Figure 5.1. If a business analyst (analogous to the document author in our problem of quotation authentication) quotes the figures of the revenues for a few selected years, the selection would be a two-dimensional subset of the spreadsheet, as illustrated in the figure.
Figure 5.1. Two-dimensional quotation in a spreadsheet document. (Source: Google Investor Relations)
Although Microsoft Excel documents can contain multiple worksheets where each is a two-dimensional array of cells, for simplicity we assume in this study that a source
spreadsheet document S consists simply of X columns and Y rows of cells11. A cell at column
x and row y is denoted to be s
x, y where 1 ≤ x ≤ X and 1 ≤ y ≤ Y, and each cell can be empty or can contain a value such as a text string or a numerical value. In this study we assume each cell value has a string representation and that sx, y means the string representation of the cell whenever the context is clear. For example, if the top-left cell contains the value 53, this is denoted as s1, 1 = “53.”A two-dimensional quotation in this study, denoted as q(a, b)(c, d), is a rectangular cut-out of the cells of size A × B that has a top-left cell sa, b and a bottom-right one sc, d where 1 ≤ a ≤
c ≤ X, 1 ≤ b ≤ d ≤ Y, A = (c – a + 1), and B = (d – b + 1).
The basic signature generation techniques described in Section 4.3 can be applied to the two-dimensional case here, but are inefficient. Specifically, for a spreadsheet containing X columns by Y rows of cells, the enumerate-all-quotations technique requires the source author to generate a signature for every possible rectangle that can be quoted. Since a rectangular quote can have a top-left corner starting at any position and a right-bottom corner ending at any position as well, the number of signatures generated may be figured out to be of the order
O((X × Y)
2). On the other hand, all X × Y cells need to be included in a quotation signature for the quote-the-whole technique. The total overhead size for the two techniques are thusO(P(XY)
2 + Q) and O(PXY + QXY), respectively.We describe below two better techniques, 2D-MUST and 2D-TRUST, that improves the total overhead size to be O(PXY + Qmin(A, B)) and O(Plog2
XY + Qlog
2XY), respectively.
Furthermore, we show that the proposed 2D-MUST can be applied to authenticate the contents of a spreadsheet document effectively. Specifically, source signatures can be generated and embedded into a Microsoft Excel document, such that modifications to the cell contents, transpositions of rows or columns, as well as additions and removals of rows or columns can be detected and changes highlighted.
5.2 Two-Dimensional Multi-Use Signatures Technique (2D-MUST)
The proposed improving technique, called the two-dimensional multi-use signatures
technique (2D-MUST) generates a set of cascaded hash values and multi-use signatures of
11 Microsoft Excel documents can contain contents other than cells, but we limit our discussion to cells and their authentication in this study.
size O(XY) such that these can be used to cover all two-dimensional quotations q(a, b)(c, d) with a top-left cell sa, b and a bottom-right one sc, d where 1 ≤ a ≤ c ≤ X and 1 ≤ b ≤ d ≤ Y. The quotation signature however is no longer O(1) but linear with the minimum of the number of rows or columns that the quotation spans, as described in detail in the following.
5.2.1 Generation of 2D-MUST Source Signatures
The first part of generating source signatures is similar to that of the one-dimensional case, where each row is considered to be an independent one-dimensional document consisting of sequential cells. Specifically, we perform the first two steps of Algorithm 4.1 for each row where the input is taken to be the row’s content Sy = s1, y || s2, y || … || sX, y for row number y (1 ≤ y ≤ Y) to yield the cascaded hash values h1, y, h2, y, …, hX, y. However, we do not simply sign these cascaded hash values as we did in Step 3 of Algorithm 4.1. This is because for a quotation with a top-left cell sa, b and a bottom-right one sc, d:
1. each row in the quotation needs a separate digital signatures, meaning that a total of (d – b + 1) digital signatures need to be included in Gq;
2. each digital signature can only verify the integrity of that row’s content, so the digital signatures cannot be used to detect transpositions of complete rows.
In the second part of the proposed technique, a second series of cascaded hash values, denoted as h'x, y where 1 ≤ x ≤ X and 1 ≤ y ≤ Y, are generated. To avoid ambiguity, we call the first series of cascaded hash values the 1D cascaded hash values, while the second series the
2D cascaded hash values. The reason of this naming will be made apparent in the following.
The 2D cascaded hash values are generated in a downward direction in a column, in contrast to the 1D cascaded hash values that was generated in a rightward horizontal direction.
Also, in contrast to the 1D cascaded hash values that use the cell contents to generate subsequent cascaded hash values, the 2D cascaded hash values are calculated using the 1D cascaded hash values, as illustrated in Figure 5.2 below. That is, whereas hx + 1, y is calculated as H(hx – 1, y || H(sx, y)), h'x, y is calculated as H(h'x, y – 1 || hx, y). The details of the proposed technique are described below as an algorithm.
Algorithm 5.1: generation of a 2D-MUST source signature.
Input: a source document S consisting of X columns by Y rows of cells s
x, y where 1 ≤ x ≤ X and 1 ≤ y ≤ Y.Output: X × Y 1D cascaded hash values h
, X × Y 2D cascaded hash values h' , and X × Yh
x–1, ys
x–1, yh
x, ys
x, yg
x, yh'
x, yh'
x, y–1h
x, y–1multi-use signatures gx, y where 1 ≤ x ≤ X and 1 ≤ y ≤ Y, which are included as part of a source signature GS.
Steps:
1. For each y from 1 to Y, compute the 1D cascaded hash values for row y as follows.
a. Set h1, y = H(Sy) where H(·) is some hash function and Sy is the content of the whole row, that is, Sy = s1, y || s2, y || … || sX, y.
b. For each x from 2 to X, compute hx, y as hx, y = H(hx – 1, y || H(sx – 1, y)).
2. For each x from 1 to X, compute the 2D cascaded hash values for column x as follows.
a. Set h'x, 1 = H(S'x) where S'x is the content of the whole column, that is, S'x = sx, 1 ||
s
x, 2 || … || sx, Y.b. For each y from 2 to Y, compute h'x, y as h'x, y = H(h'x, y – 1 || hx, y – 1).
3. For each y from 1 to Y and for each x from 1 to X, compute the multi-use signature gx, y
as gx, y = Sign(H(h'x, y || H(hx, y || H(sx, y)))), where Sign(·) is a signing function of some digital signature algorithm.
5.2.2 Generation and Verification of 2D-MUST Quotation Signatures
When a quotation q(a, b)(c, d) is quoted from S, an appropriate quotation signature Gq is generated using S and GS that includes the following.
1. A digital certificate of the source author AS, so that the document reader can verify the association of the identity of AS and its public key.
2. The starting 1D cascaded hash values for each row in the quotation, that is, ha, y for b ≤ y
≤ d.
3. The starting 2D cascaded hash value for the last column in the quotation, that is, h'c, b. 4. The multi-use signature gc, d.
Figure 5.2. Illustration of cascaded hash value calculation for a cell sx, y in 2D-MUST.
Similar to the case of one-dimensional MUST, the starting 1D cascaded hash values included in step 2 above are used by a document reader RD to regenerate the 1D cascaded hash values hc, y using q(a, b)(c, d) for b ≤ y ≤ d by a process similar to Step 1b of Algorithm 5.1.
Then, the starting 2D cascaded hash value h'c, b and the regenerated 1D cascaded hash values
h
c, y where b ≤ y ≤ d are used to calculate the 2D cascaded hash value h'c, d by a process similar to Step 2b of Algorithm 5.1. If any of the cells in the quotation has been modified, or if rows in the quotation has been transposed, then the value h'c, d so generated by RD will not be the same as that generated by AD, and so failing the verification performed by RD – Verify(H(h'c, d|| H(hc, d || H(sc, d)))), gc, d) – where Verify(·) is the reciprocal digital signature verification function of the Sign(·) function used by the source author.
5.2.3 Total Overhead Size of 2D-MUST
A 2D-MUST source signature GS for a source document S with X columns by Y rows of cells always contain exactly X × Y 1D cascaded hash values, X × Y 2D cascaded hash values, and X × Y multi-use signatures, and so the size of GS is of the order O(XY). A 2D-MUST quotation signature Gq for a quotation of size A × B contains B 1D cascaded hash values, one 2D cascaded hash value, and one multi-use signature, and thus the size of Gq is of the order
O(B). It can be figured out that the proposed technique may be flipped around, where 1D
cascaded hash values are generated for columns and 2D cascaded hash values across rows, meaning that the quotation signature Gq for a quotation of size A × B can be made to containA 1D cascaded hash values and hence to be of the size of order O(A) instead of O(B). If both
types of cascaded hash values and multi-use signatures are generated by the source author, the size of GS will double but still be of the order O(XY), while the size of Gq will be O(min(A, B)) since a document author can choose one of the two sets of cascaded hash values and multi-use signatures that yield the smaller quotation signature.Consequently, the total overhead size of the 2D-MUST with P document authors and Q document readers, assuming that the average numbers of columns and rows quoted are A and
B respectively, is O(PXY + Pmin(A, B) + Qmin(A, B)), or equivalently, O(PXY + Qmin(A, B)),
contrasted with the total overhead size of O(P(XY)2 + Q) for the enumerate-all-quotations technique.5.3 Two-Dimensional Tree Root Uni-Signature Technique (2D-TRUST)
The next proposed improving technique, called the two-dimensional tree root
uni-signature technique (2D-TRUST), uses a tree-like construction of hash values similar to
that of the previously-described one-dimensional TRUST such that the size of quotation signatures can be reduced to be of the order O(log2XY), instead of O(XY), for the basic
quote-the-whole technique. Only the root of the two-dimensional tree of hash values needs to be signed by the source author, thus maintaining the size of the source signature to be of the order O(1).5.3.1 Generation of 2D-TRUST Source Signatures
For a source spreadsheet document S consisting of X × Y cells, the source author generates a two-dimensional tree of hash values for the cells in S in a way similar to that of the one-dimensional TRUST, as described in detail in the following algorithm. The hash value
h
r1, 1 of the tree root is then signed with some digital signature algorithm to get a two-dimensional tree root uni-signature gr1, 1 = Sign(hr1, 1).Algorithm 5.2: generation of a 2D-TRUST tree of hash values.
Input: a source spreadsheet S consisting of X × Y cells s
x, y where 1 ≤ x ≤ X and 1 ≤ y ≤ Y.Output: a two-dimensional tree of hash values h
ix, y where i is the depth of the tree node; x andy are the indices of the nodes at depth i; and h
r1, 1 is the hash value of the tree root.Steps:
1. Calculate a hash value for each cell sx, y to get the lowest-level hash values, that is, set
h
1x, y = H(sx, y) for 1 ≤ x ≤ X and 1 ≤ y ≤ Y.2. Initialize the value of i to be 1.
3. For all values of x and y where 1 ≤ x ≤ X/2 and 1 ≤ y ≤ Y/2, concatenate the hash values in fours to compute the next-level hash values as hi+1x, y = H(hi2x–1, 2y–1 || hi2x–1, 2y ||
h
i2x, 2y–1 || hi2x, 2y).4. Perform one of the following operations, depending on the values of X and Y:
a. if both X and Y are even, then set X to be X/2 and set Y to be Y/2;
b. if X is odd and Y is even, then set X to be (X + 1)/2 and Y to be Y/2; and set hi+1X, Y = H(hi2X–1, 2Y–1 || hi2X–1, 2Y);
c. if X is even and Y is odd, then set X to be X/2 and Y to be (Y + 1)/2; and set hi+1X, Y = H(hi2X–1, 2Y–1 || hi2X, 2Y–1);
d. if both X and Y are odd, then set X to be (X + 1)/2 and Y to be (Y + 1)/2; and set hi+1X, Y
= hi2X–1, 2Y–1.
5. Increment the value of i by 1.
6. Denote the numbers of columns and rows of hash values for level i as Xi = X and Yi = Y, respectively.
7. Go to Step 3 if the tree root hash value has not been calculated, or equivalently, if X or Y is larger than 1.
8. Set the total number of levels r to be i.
5.3.2 Generation of 2D-TRUST Quotation Signatures
When a document author AD quotes a two-dimensional rectangle of cells in a source spreadsheet, the quotation signature Gq is generated by including the signature gr1, 1 and a set of complementary hash values Hq containing some of the hash values generated in Algorithm 5.2. Specifically, for a rectangular quotation q(a, b)(c, d) with a top-left cell sa, b and a bottom-right one sc, d where 1 ≤ a ≤ c ≤ X and 1 ≤ b ≤ d ≤ Y, the complementary hash set is selected using the algorithm below. The purpose of Hq is the same as that in the one dimensional TRUST, that is, the tree root hash value hr1, 1 can be reconstructed from q and Hq.
Algorithm 5.3: generation of a 2D-TRUST complementary hash set.
Input: a source spreadsheet S consisting of X×Y cells s
x, y where 1 ≤ x ≤ X and 1 ≤ y ≤ Y, and a two-dimensional rectangle of quoted cells q(a, b)(c, d) with a top-left cell sa, b and a bottom-right one sc, d where 1 ≤ a ≤ c ≤ X and 1 ≤ b ≤ d ≤ Y.Output: a complementary hash set H
q.Steps:
1. Calculate from S the tree of hash values hix, y using Algorithm 5.2, with Xi×Yi hash values generated for level i.
2. Set Hq to be the empty set initially, and set the value of i to be 1.
3. Add to Hq the following hash values so that the next-level hash values can be reconstructed from q:
a. if a is even and b is odd, then add the values hia–1, b and hia–1, b+1 to Hq; b. if a is odd and b is even, then add the values hia, b–1 and hia+1, b–1 to Hq; c. if both a and b are even, then add the values hi , hi and hi to Hq;
d. if c is even, d is odd, and d < Yi, then add the values hic–1, d+1 and hic, d+1 to Hq; e. if c is odd, d is even, and c < Xi, then add the values hic+1, d–1 and hic+1, d to Hq; f. if both c and d are odd, then:
i. add the value hic+1, d to Hq if c < Xi; ii. add the value hic, d+1 to Hq if d < Yi;
iii. add the value hic+1, d+1 to Hq if c < Xi and d < Yi.
4. Set a to be a/2, b to be b/2, c to be c/2, and d to be d/2.
5. Increment the value of i by 1, and go to Step 3 if i is smaller than r.
As a simple example, when quoting a single cell s3, 2 from a 5×5 spreadsheet, we add the values h13, 1, h14, 1, and h14, 2 to Hq when i is 1 (note the hash value h14, 1 is added twice, once in Step 3b, and once in Step 3e). The values of a and c are then set to be 2 while the values of b and d are set to be 1 in Step 4. In the next iteration, the values h21, 1, h21, 2, and h22, 2 are added to Hq when i is 2, and the values of a, b, c, and d are all set to be 1. In the last iteration, the hash values h32, 1, h31, 2, and h32, 2 are added to Hq when i is 3. The hash values added to Hq for this example is illustrated in Table 5.1 below.
Table 5.1. Hash values selected in the 2D-TRUST complementary hash set when quoting a cell s3, 2 from a 5×5 spreadsheet.
1 2 3 4 5
1 2
h
21, 1h
21, 23
h
13, 1s
3, 24
h
14, 1h
14, 2h
22, 2h
31, 25
h
32, 1h
32, 25.3.3 Verification of 2D-TRUST Quotation Signatures
It is assumed that a document reader RD can identify a 2D-TRUST authenticable-quotation in a message and can extract or reconstruct the two-dimensional quotation q(a, b)(c, d), the signature gr1, 1, the complementary hash set Hq, and the values Xi and
Y
i for 1 ≤ i ≤ r. Also, we assume that RD can retrieve hash values from Hq in the same order as hash values were added to Hq by AD. To verify q, RD performs the steps described by the following algorithm to reconstruct the tree root hash value hr1, 1 using q and Hq, and then verify the quotation using gr1, 1.Algorithm 5.4: verification of a 2D-TRUST quotation signature.
Input: a two-dimensional quotation q
(a, b)(c, d) with a top-left cell of sa, b and a bottom-right one of sc, d where a ≤ c and b ≤ d, a signature gr1, 1, and a complementary hash set Hq.Output: result of quotation verification.
Steps:
1. Set the value of i to be 1, and calculate the lowest-level hash values for the cells in the quotation, that is, set h1x, y = H(sx, y) for a ≤ x ≤ c and b ≤ y ≤ d.
2. Retrieve the following values from Hq to calculate the next-level hash values:
a. if a is even and b is odd, then retrieve the values hia–1, b and hia–1, b+1 from Hq; b. if a is odd and b is even, then retrieve the values hia, b–1 and hia+1, b–1 from Hq; c. if both a and b are even, then retrieve the values hia–1, b–1, hia, b–1, and hia–1, b; d. if c is even, d is odd, and d < Yi, then retrieve the values hic–1, d+1 and hic, d+1; e. if c is odd, d is even, and c < Xi, then retrieve the values hic+1, d–1 and hic+1, d; f. if both c and d are odd, then:
i. retrieve the value hic+1, d from Hq if c < Xi; ii. retrieve the value hic, d+1 from Hq if d < Yi;
i. retrieve the value hic+1, d from Hq if c < Xi; ii. retrieve the value hic, d+1 from Hq if d < Yi;