具有資訊隱密性與容錯能力的分散式雲端儲存系統

(1)

Data Confidentiality and Robustness in Decentralized

Cloud Storage Systems

A Dissertation by Hsiao-Ying Lin

Submitted to the Department of Computer Science National Chiao Tung University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy in

Computer Science

Dissertation Director: Wen-Guey Tzeng May 2010

(2)

c

2010 - Hsiao-Ying Lin

(3)

Abstract

A cloud storage system consisting of a collection of storage servers provides storage services over the Internet for long-term storage. A user can store data into the system and access data from anywhere at any time via the Internet access. However, storing data in a third party’s cloud system brings a serious concern on the data confidentiality. We consider a cloud storage system model that has no central authority. A tight integration of public key encryption schemes and random erasure codes is developed. By using this integration, we present a secure cloud storage system, which guarantees the data confidentiality and robustness and supports the secure data forwarding functionality. Hence, in our storage system, a user can not only securely store data but also forward data to other user in a confidential way.

Keywords: Randomized erasure codes, homomorphic encryption schemes, networked storage systems, cloud storage systems, network coding.

(4)

(5)

(6)

Acknowledgements

I still find it hard to believe that this is the end of my graduate school days. There are several people without whom this work would not have been possible, first and foremost is my advisor, Prof. Wen-Guey Tzeng. With his endless encouragement and patience, he has been a terrific advisor. Many years of collaboration with him have come to fruition in this dissertation. My appreciation also goes to all the other members of my committees for their valuable feedback and time. Prof. John Kubiatowicz and Prof. Doug Tygar gave me large amount of their time and served as my advisors for my third year when I were at UC Berkeley.

Despite being neither my advisor nor a member of my thesis committees, Dr. Cheng-Kang Chu, one of my senior officemates, has taught me many things, right from the tricks of security proofs to administrative services of our laboratory. Most importantly, he always had the time for me whether I was asking him even if he has already graduated and been a researcher in Singapore. These acknowledgements would be incomplete without thanking all of my officemates for all the wonderful time we had together. Especially, Shiuan-Tzuo and Yi-Ruei always helped me out from the huge amount of administrative things.

(7)

I thank the department administrative staff not only for that they took care of the university formalities for me but also that they encouraged and supported me whenever I needed. I am going to miss the cookies or cakes they gave me when I occasionally walked in the department office.

I cannot thank my parents and my husband enough in words for their unconditional love and support at all time. I would also like to thank my sisters for their constant support during my years at NCTU.

(8)

List of Figures

1.1 The centralized architecture and the decentralized

architec-ture. . . 2

1.2 IDC analysis. . . 4

2.1 Replication on storage devices. . . 9

2.2 RAID-5. . . 11

2.3 RAID-6. . . 11

2.4 An encoding example of a Reed Solomon code. . . 12

2.5 An example of a storage system that uses a linear code. . . 12

2.6 The EVENODD encoding. . . 13

2.7 A system using the EVENODD encoding. . . 14

2.8 The STAR encoding and a system storing the encoded data. . 15

2.9 A networked storage system that uses a LDPC code. . . 15

2.10 The encoding structure of Tornado codes. . . 16

2.11 An abstract overview of a fountain code. . . 19

2.12 A networked storage system using a random linear code. . . . 20

2.13 A networked storage system using a decentralized erasure code. 21 2.14 A networked storage system that uses a hybrid strategy. . . . 21

(12)

2.15 File 1 before and after being encrypted in a Plutus system. . . 23 3.1 A storage system using the random erasure code. . . 29 3.2 Our first system model of our secure cloud storage system. . . 31 3.3 The advanced system model of our secure cloud storage system. 32 3.4 A storage system using the secure decentralized erasure code. 34 3.5 The security game for the chosen plaintext attack. . . 36 4.1 The primitive encryption scheme. . . 39 4.2 The flowchart of the threshold encryption scheme. . . 39 4.3 The storage process of our first secure cloud storage system. . 42 4.4 A storage system using the secure decentralized erasure code. 46 4.5 The event of a successful retrieval is showed as the shadow area. 51 4.6 The random bipartite graph H. . . 52 4.7 The reduction for the security of our first secure storage system. 60 5.1 The ReKeyGen algorithm of the proxy re-encryption scheme. . 64 5.2 The ReEnc algorithm of the proxy re-encryption scheme. . . . 64 5.3 The Dec algorithm of the proxy re-encryption scheme. . . 65

(13)

List of Tables

4.1 Computation cost of each step in our first secure storage system. 48 5.1 Computation cost of each algorithm in our advanced secure

(14)

Chapter 1 Introduction

A cloud providing on-demand services on the Internet has a collection of servers. Servers in a cloud connect to each other via networks. The con-tributed resources, such as storage and computing power, from servers are implicitly merged as one huge resource and shared among users. A cloud storage system is an accumulation of storage servers that provide a location independent storage service as long as a user has network access. A user can store data in and retrieve data from a cloud storage system from anywhere at any time. Companies can also outsource their data storage and manage-ment to a cloud storage system. Cloud storage systems not only provide the storage service as a purely virtual file system but also contribute to many service-based applications. For instance, web-mail systems [1] allow people to read and to write emails through the web browser without storing those emails in local machines [2].

The advantage of using cloud storage systems instead of the local storage devices is substantial. A user can ubiquitously enjoy the cloud storage service

(15)

Figure 1.1: The centralized architecture and the decentralized architecture. The centralized architecture has a central authority that manages how data are stored and retrieved. In the decentralized architecture, each storage server independently manages the data storage.

without the heavy burden of hardware or software maintenance. In additional to low maintenance cost and ubiquitous use, cloud storage systems guarantee the data availability for a long period of time. Not only users but also the service providers gain advantage from the cloud storage systems. The service providers can deliver quality of service to users non-interactively in real-time in the cloud storage systems.

Cloud storage systems are not new technologies. A networked storage system is a simple cloud storage system and this technology can be traced back to the networked attached storage (NAS) [3] and the Network File Sys-tem (NFS) [4]. The extra storage devices are deployed in the network and a user can access the devices via network connection. Afterward, the extra storage devices are replaced by the storage servers because the cost of hard-ware becomes much economical. Moreover, the decentralized architecture of the storage servers are proposed for better scalability because any storage server can join or leave without the control of a central authority. Figure 1.1 shows the centralized architecture and the decentralized architecture of the storage servers.

(16)

Contributed by the rapid and ubiquitous network access technology, cloud storage systems have become a reality. For example, Amazon Simple Storage Service (S3) provides the storage service in which users store and retrieve data via a web interface, and IBM also offers the storage cloud for enterprises. However, storing data in the cloud brings up many security issues. According to a research report from International Data Corporation (IDC) Enterprise Panel [5], the security issue is the top challenge for the cloud services. The second and third issues are performance and availability. Figure 1.2 shows the IDC analysis about issues of the cloud. Without the security guarantee, few people will use the system and users face potential risks of privacy leaking and confidentiality breaking.

In the history of networked storage systems, which are the ancestors of cloud storage systems, some of them use encryption schemes to provide the data confidentiality. Some of storage systems [6, 7, 8, 9] using symmetric encryption schemes to encrypt data on disk. In those storage systems, the storage servers own the keys. Thus, the data confidentiality is not against the storage servers. Some of the storage systems, such as Farsite [10], uses asymmetric encryption schemes to encrypt the stored data. However, those systems use the replication mechanism for the data robustness. The replica-tion mechanism causes a high storage overhead.

Motivated by the strong need for security mechanisms for cloud storage systems, this dissertation focuses on the data confidentiality and robustness of cloud storage systems with the decentralized architecture. My dissertation statement can be summarized as follows.

(17)

ar-Figure 1.2: IDC analysis.

IDC conducted a survey of 244 IT executives/CIOs and their colleagues about their companies views about IT Cloud Service. The main results are summarized. The top challenge is security. IDC concluded that cloud services still need to be more secure.

chitecture provides strong data confidentiality against collusion of all storage servers and is robust with low storage overhead. Furthermore, the system is adaptable to allow data forwarding inside the cloud in a confidential way.

To validate my dissertation statement, this dissertation presents a cloud storage system that is decentralized and provides the data confidentiality even if all storage servers are compromised. Moreover, it also provides an improved cloud storage system, which additionally supports the data for-warding function in a confidential way. An investigation is done on the relationship between the communication-efficiency parameter and the prob-ability of a successful data retrieval event. A general setting of the parameters is provided according to the scale of the storage system.

Through the process of validating my dissertation statement, this dis-sertation describes two cloud storage systems and provides both a theoretic

(18)

analysis of the performance and the probability of a successful data retrieval event. The data confidentiality and robustness issues are addressed by using public key encryption schemes and random erasure codes. The core tech-nique we developed is a tight integration between the public key encryption schemes and random erasure codes. The following novel contributions are made in this dissertation:

• A novel cloud storage system model that provides both storage service and key-management service.

• Cloud storage systems that allow each storage server independently encoding data when the data are in encrypted form.

• A cloud storage system that allows data being forwarded in an en-crypted form after the data are encoded and distributed.

• A general setting for the communication-efficiency parameters is sug-gested according to the scale of the storage system.

Roadmap. Chapter 2 presents some background on networked storage systems and a discussion of the security needs of networked storage systems to illustrate our motivation for the secure cloud storage systems. Chapter 3 presents some algebraic setting and notation, our first cloud storage system model, the advanced cloud storage system model, and the threat model for the data confidentiality. Chapter 4 presents the construction of the first secure cloud storage system. It also contains the full analysis of correct-ness, performance, security, and the probability of a successful data retrieval event. Chapter 5 presents the advanced cloud storage system, which supports

(19)

the data forwarding function. Chapter 6 explores a variety of mechanism for security issues about data integrity checking in cloud storage systems and discusses how our systems can potentially integrate with some of them. Chapter 7 concludes and presents some future directions for further research.

(20)

Chapter 2 Review of Networked Storage

Systems

We introduce the problem domain of this dissertation in this chapter. There has been much research on networked storage systems in the last decade of years. Networked storage systems are classified as centralized and decen-tralized ones depending on the existence of a central authority. Because the purpose of networked storage systems is to store data reliably over long pe-riods of time, this chapter further classifies them by what technique they employed for data robustness against machine failures. This chapter de-scribes some well-known networked storage systems in literature, although this brief survey is by no means complete or exhaustive. In particular, we do not cover the papers related to altered data detection and correction in networked storage systems. After introducing the basic research area, we discuss the data confidentiality problem in the context of networked storage systems.

(21)

Robust storage technologies. Ensuring data available against machine failures requires the introduction of redundancy. Replication is the simplest approach for achieving resilience to arbitrary corruption of storage. Unfor-tunately, this method has a high overhead in storing full copy of the file at each storage server. Erasure codes are applied in many networked storage systems for tolerating failure of storage servers with lower storage overhead. Networked storage systems apply each of the technologies or a mixed of them to provide a robust data storage in either centralized or decentralized archi-tecture.

2.1 Centralized Data Management

A centralized networked storage system has a central authority which may be a single server or a small cluster of central servers. The central authority has a global view of the whole storage system, such as the network topology and the global routing table, and controls how data are stored among many storage servers.

2.1.1 Mirrors and Replicas

At the early years, the network-attached storage (NAS) [3] deploys extra storage devices in the network and a user can access the devices via net-work connection. The Netnet-work File System (NFS) [4] is then proposed for the application of a file system. A user can backup his data in the NAS devices. This mechanism is analog to the one in the traditional file system with multiple directly-attached storage disks, such as the redundant array of

(22)

Figure 2.1: Replication on storage devices.

The left figure is RAID-1 and the right figure is a small SAN with 4 storage devices. In RAID-1, each drive stores one replica of the data. Similarly, each storage device in the SAN stores a copy of the data. However, the storage devices in SAN are consolidated together via a high-speed network.

independent disks(RAID)-1. The exception is that replicas are transmitted via a network connection instead of a physical connection. In contrast to NAS which is a file-wise storage system, Storage Attached Network (SAN) is a block-wise storage system. A file system can be built upon SAN. Mirroring is the simplest robust technique offered by SAN. Figure 2.1 illustrates the RAID-1 and a small SAN.

Many file systems built upon the networked storage systems use replica-tion and store replicas on multiple storage servers. For example, AFS [11], Coda [12], SFS [13], and Plutus [14] are all distributed file systems that store replicas and have varied replication mechanisms. AFS maintains replicas in a read-only manner except one of them is writable to achieve the consistency among all replicas, while Coda allows all storage servers to receive updates and uses an expensive repair mechanism to handle conflicts.

When the scale of a networked storage system becomes bulky, the replica placement and consistency problems arise. The replica placement problem is to decide which set of storage servers to store the replicas such that the life time of data is extended. The consistency problem is to keep all

(23)

repli-cas consistent with the original. Fortunately, the central authority can well address these two problems by using the global system information.

2.1.2 Erasure Codes

The RAID-5 uses erasure codes to survive one drive failure and the RAID-6 can survive the failure of two drives. Figure 2.2 and Figure 2.3 show how data are stored in RAID-5 and RAID-6. The parity operation is performed in RAID-5 systems. One parity result is additionally stored to tolerate one erasure error. In RAID-6 systems, a parity computation and a Reed-Solomon encoding are performed over the stored data. As a result, up to two erasure errors a RAID-6 system can tolerate.

Erasure codes are codes that encode the input of k symbols as a codeword of n symbols such that as long as k out of n symbols of the codeword are available, the original k symbols can be decoded back. This code can tolerate (n − k) erasure errors.

Erasure codes are applied in many networked storage systems for data robustness with lower storage overhead. The erasure codes are mainly used by the following framework for the robustness of networked storage systems against the failure of machines when a machine failure is modeled as an erasure error. Assume that there are n storage servers in a networked storage system. A user represents a file as an input of the encode algorithm, and encodes the file as a codeword. Later, each storage server stores a symbol of the codeword. As a result, as long as k out of n storage servers are available, the file can be decoded and retrieved by the user. The scalable distributed

(24)

Figure 2.2: RAID-5.

The left figure is RAID-5 with 3 drives and the right figure is RAID-5 with 4 drives. The parity function is a bitwise XOR over the input. In a RAID-5 system, data are available as long as at most one drive crashes.

Figure 2.3: RAID-6.

The figure shows a RAID-6 with 4 drives. In a RAID-6 system, data are available as long as at most two drives crash.

storage [15] uses the Lincoln Erasure Codes, a class of erasure codes, to provide the data availability.

The central authority of a networked storage system can choose special coding method to obtain a better ability on tolerating errors or a better coding performance. We categorize the storage systems that use erasure codes by what operations the erasure codes employ.

Algebraic Operation based Erasure Codes

One of the most well-known erasure codes is the Reed-Solomon codes and the storage system in [16] uses Reed-Solomon codes to tolerate both erasure and faulty errors. A simple Reed-Solomon code is described as follows and illustrated in Figure 2.4. A message m is represented as k elements in a finite field, i.e. m = (m1, m2, . . . , mk). Consider a polynomial function f with

(25)

Figure 2.4: An encoding example of a Reed Solomon code.

The message defines a polynomial function and a codeword is defined by the values of the polynomial on n points.

Figure 2.5: An example of a storage system that uses a linear code.

The central authority encodes the messages into a codeword (C1, C2, C3) and sends

a distinct symbol to a storage server for storage.

degree k − 1 where f(x) = m1+ m2x + m3x2+ · · · + mkxk−1. A codeword

c with length n is (f (1), f (2), . . . , f (n)), where the polynomial function is computed over certain finite field. As long as k elements in the codeword are available, the polynomial function f can be recovered as well as the message m.

In a storage system using a linear code to provide the data robustness, the central authority can encode messages and recover them back. Figure 2.5 shows the example where there are 2 messages and 3 storage servers and the coding is operated in a finite field. After the central authority gets the two messages, he generates a generator matrix and encodes the messages via the

(26)

Figure 2.6: The EVENODD encoding.

The data are represented as a (p − 1) × p table. An entry is a bit of the data. After the encoding, the codeword is represented as a (p − 1) × (p + 2) table. Each storage server stores one column of the resulting table.

generator matrix. The codeword contains 3 symbols and each storage server stores one of them. After the central authority retrieves 2 out of 3 codeword symbols, he performs the decoding process and sends the messages back to the user.

XOR Operation based Erasure Codes

Some erasure codes are proposed for their excellent performance. They only use exclusive-or operations. As a result, the storage systems also have good

(27)

Figure 2.7: A system using the EVENODD encoding.

Each storage server stores a column of the encoded table. This storage system tolerates the failure of two servers.

performance on the data storage and retrieval processes. The EVENODD code [17, 18] and the STAR code [19] are proposed for tolerating 2 and 3 erasure errors, respectively. The encoding of the EVENODD codes and the STAR codes are efficient, but the codes can only tolerate constant number of erasure errors. Figure 2.6 illustrates the EVENODD encoding and Figure 2.7 shows how a system stores the encoded data. There are (p+2) storage servers SS1,SS2,...,SSp+2in the system and each of them stores a column data, which

is p bits. The STAR codes use an additional parity column for tolerating one more erasure error. Figure 2.8 shows the STAR encoding and how a system stores the encoded data.

Many low-density parity check (LDPC) codes have also more efficient encoding and decoding algorithms than other erasure codes that use linear algebraic operations do. A networked storage system that uses a systematic LDPC code is described as follows. Figure 2.9 shows an overview of the storage system. Assume that there are n messages. The storage servers are divided into two groups R and L. The R group consists of n storage servers

(28)

Figure 2.8: The STAR encoding and a system storing the encoded data.

STAR codes tolerate the failure of 3 servers. There are 3 columns of parity bits.

Figure 2.9: A networked storage system that uses a LDPC code.

The system contains 7 storage servers and 4 out of them store the original data. The other 3 storage server store the parity results of the data.

and each of them keeps one message. The L group has the rest of the storage servers and each of them stores a parity over a subset of n messages.

Tornado codes [20, 21] are also a class of LDPC erasure codes that have fast encoding and decoding algorithms. Tornado codes illustrated in Fig-ure 2.10 use irregular bipartite graphs as the encoding structFig-ures. The archival storage [22] uses Tornado codes as the fault-tolerance technique.

Different LDPC codes define different policies on the L group, i.e. how many and which messages are used to produce the parity. The experimental results in [23] show that how differently LDPC codes perform on the ro-bustness ability. When the number of message symbols are below 100, the

(29)

Figure 2.10: The encoding structure of Tornado codes.

Tornado codes use cascaded irregular bipartite graphs. The k input message bits are listed as k left vertices of the first bipartite graph. The right vertices of the first bipartite graph are ak parity bits. A similar encoding process is performed on the rest L − 1 cascaded bipartite graphs. The decoding algorithm is simple. For each right node whose all but one neighbors are known, the missing neighbor can be recovered. The decoding algorithm successfully terminates when all k input message bits are recovered.

systematic LDPC codes perform better (fewer codeword symbols are required for a successful decoding). However, when the number of message symbols equals or greater than 100, irregular repeat-accumulate LDPC codes perform better.

2.2 Decentralized Data Management

Because no central authority can arrange how data are distributed among and retrieved from the whole storage system, a storage server independently manages the stored data without global information of the networked stor-age system. Although the data manstor-agement without a central authority is much more complicated, the decentralized architecture has many advantages in practice. A decentralized architecture for storage systems offers a good

(30)

scalability since any storage server can join or leave without the control of a central authority. The architecture can bear a global-size system scale and the resulting massive storage space. Peer-to-peer networks are major concrete examples for the decentralized architecture.

In this section, We describes several decentralized networked storage servers that use different robust storage technologies.

2.2.1 Mirrors and Replicas

When there is no central authority, a datum can be flooded into the storage system and each (available) storage server stores a copy of it. PAST [24] leaves the choice of a replication factor, i.e. the number of replicas, to the data owner. Farsite [10] is a reliable file system that uses random replication to support long-term data persistence. Glacier [25] is a distributed storage system that uses massive replication to provide data robustness across large-scale correlated failures. Many strategies aim to arrange where a replica is stored such that this replica can be found when needed. Carbonite [26] is a replication algorithm for keeping data durable at a low cost. It contributes to distributed storage prototypes such as Antiquity [27], OverCite [28], and UsenetDHT [29].

Consider a peer-to-peer network as a special case of a networked storage system. The replica placement and the quality of availability are a huge research area in peer-to-peer networks. For example, a repair mechanism handles how replicas should be re-distributed when a peer leaves the network. In addition to the static control over the number of available replicas,

(31)

proactive replication [30, 31] provides a better guarantee for data durabil-ity. The replication algorithm in [30] produces replicas periodically. The replication algorithm in [31] generates replicas by predicting the machine availability.

2.2.2 Erasure Codes

Similar as the case in the centralized architecture, applying an erasure code provides a lower storage overhead while the resulting system retains the data robustness. Technologies such as fountain codes, decentralized erasure codes, or random linear codes are proposed for the decentralized storage systems [32, 33, 34, 35, 36, 37, 38].

XOR Operation based Erasure Codes

Fountain codes enable a robust storage technique. Luby transform (LT) codes and Raptor codes [39, 40] are two classes of fountain codes. In a fountain code, the message symbols and codeword symbols can be modeled as vertices in a random bipartite graph. The degree of each vertex is random over some probabilistic distribution. Different distributions define different codes. Figure 2.11 illustrates an overview of the fountain codes. In the LT codes, each codeword symbol is the result of exclusive-or over d message symbols, where the value d is random according to an Ideal Soliton distribution and a robust Soliton distribution. The Raptor code is a class of the fountain code with linearly encoding and decoding complexity. It uses a modification of the Ideal Soliton distribution for the degree d. The distributed storage systems

(32)

Figure 2.11: An abstract overview of a fountain code.

The data A, B, C and D are randomly distributed according to some probabilistic distribution. Different codes define different distributions.

using LT codes [35] and the system using Raptor codes [37, 38] are proposed in wireless sensor networks.

Algebraic Operation based Erasure Codes

A random linear code is a linear code with a random generator matrix. Each storage server can randomly determine a column of the generator matrix. The result in [34] shows that a sub-square matrix of the generator matrix is invertible with an overwhelming probability. That is, the probability that a user can successfully retrieve his data with an overwhelming probability. Figure 2.12 shows an example of a small storage system that uses a random linear code. There are 6 storage servers SS1, SS2, ..., and SS6 in the system.

After receiving some message symbols, a storage server randomly picks co-efficients and linearly combines the received message symbols. The storage server SS1 stores the codeword symbol C1 and the coefficients x1a, x1b, x1c

and x1d, where x1d = 0.

The number of codeword symbols that a datum is contributed could in-fluence the probability of a successful retrieval. Therefore, a decentralized erasure code [34] is proposed for certain settings. The system is summarized

(33)

Figure 2.12: A networked storage system using a random linear code.

Each storage server selects a random coefficient for each received message. The linear combination is performed by the storage server. Each column of the gener-ator matrix and each codeword symbol is determined and computed by a storage server. No central authority is required in the encoding process.

in Figure 2.13. In this approach, to store k data, a user makes v copies for each datum and distributes them to randomly selected storage servers in the storage system. After receiving those data, each storage server encodes those data and stores the result. To retrieve k data, the user randomly queries k storage servers for the encoding results and tries to decode those k data. It has been shown that for n storage servers and k messages, if n = ak and v = bk ln k where b > 5a, the probability of successful retrieval is at least 1 − k/p − o(1), where p is the size of the used group [34].

2.2.3 Hybrid Strategy

The hybrid strategy is a mixed method of the replication and an erasure code. It trades off the advantages and disadvantages between those two approaches. Many well-known and newly decentralized networked storage systems, such as OceanStore [41] and Total Recall [42], take the hybrid strategy. The

(34)

Figure 2.13: A networked storage system using a decentralized erasure code.

The settings for the parameters n, k, and v and the random process of data distri-bution determine the probability of a successful data retrieval.

Figure 2.14: A networked storage system that uses a hybrid strategy. The storage system simultaneously stores the data and the resulting symbols of the erasure coding. The number of the replicas of the data and the number of symbols of the rasure code influence the data availability of the storage system.

storage servers are divided into primary storage servers that keep the replicas and the secondary storage servers that store the encoding results.

The ratio of two kinds of storage servers may influence the data availabil-ity. Some research [43, 44, 45] are conducted to find an approximated-optimal ratio for providing a practically long-term data availability. Figure 2.14 gives a layout for an erasure coding with primary copy. The experimen-tal results [45] are produced by running experiments on the real traces of large-scale distributed storage systems (e.g. Farsite [10] and Overnet [46]).

(35)

2.3 Challenge of Data Confidentiality

After the storage servers are located on the network, the control of the storage servers is no longer at the data owner’s hand. Therefore, there is an increasing attention on the data confidentiality.

2.3.1 Cleartext Storage

Early networked storage systems are proposed for robust storage with simple access control mechanisms. The data are stored in cleartext. When the storage system uses an erasure code for data robustness and each storage server stores a codeword symbol, the system has certain data confidentiality because less than k storage servers cannot recover the original message, where the message has k symbols.

2.3.2 Symmetric Encryption

After the hardware and software of the storage servers are improved, the stor-age servers can handle more computation. Many new networked storstor-age sys-tems provide stronger data confidentiality by storing data in encrypted form. Once a storage system receives data from the owner, the data are encrypted by symmetric encryption scheme such as DES or AES before stored into the physical drives. Blaze’s CFS [6], TCFS [8], StegFS [7] and NCryptfs [9] are file systems that encrypt data before writing them to storage drives. Those file systems only protect the stored data at rest and assume that the storage servers are fully trusted.

(36)

Figure 2.15: File 1 before and after being encrypted in a Plutus system. The file is divided into blocks and each file block is encrypted by using a distinct symmetric key. All symmetric keys are encrypted by using another symmetric key MK. The encrypted symmetric keys are stored with the en-crypted file blocks. A user may use a different MK for each file.

Plutus [14] and Tahoe [36], use encryption schemes to protect the data con-fidentiality against both internal and external attackers. In OceanStore, all information that enters the system must be encrypted while the owner man-ages the access control. For example, when the owner wants to share a datum with others, he needs to distribute the symmetric key to the authorized read-ers. Similarly, in both Plutus and Tahoe, a user needs to encrypt files with distinct symmetric keys and manage all of the keys by himself. A file before and after being encrypted in Plutus is illustrated in Figure 2.15. A newly encryption service [47] is provided for any cloud storage user who uses Ama-zon Simple Storage Service. The encryption service encrypts the user’s data by using AES and stores the ciphertext into the cloud storage system for the user. Again, this application assumes that the servers who encrypt the data are fully trusted.

(37)

when the storage servers are honest and secure or when users take the re-sponsibility of key management over the huge amount of symmetric keys. For most cases, those storage servers are assumed that they will follow the user-defined access policy on the stored data and keep the stored data in the encrypted form all the time. The trust on all storage servers sometimes is un-realistic especially when the storage system is decentralized and distributed over a large geographic area. Any one of the storage servers could be vul-nerable from internal or external attacks. On the other hand, the burden of key management for users should be decreased or moved to the servers. Hence, stronger data confidentiality with low overhead on users is required. The data should be kept secret even if all storage servers are compromised, and users store as few as possible keys and put as less as possible effort on the key management.

2.3.3 Public Key Encryption

Applying a public key encryption scheme in a centralized networked storage system gives a straightforward solution to the data confidentiality issue. A user encrypts the data and then stores into the system. The central authority simply treats the ciphertext as a RAW data just like in the non-encrypted case. Similarly, the strong data confidentiality is also achievable in a de-centralized system with replication technology. For instance, Farsite [10] uses the hybrid encryption to protect the data confidentiality and provide an access control mechanism. In Farsite, a datum is first encrypted by using a symmetric encryption and the symmetric key is encrypted by using the

(38)

owner’s public key. The user only needs to store his secret key for all of his data. When he wants to share some data with some user, he encrypts the corresponding symmetric keys by using the authorized user’s public key and stores the ciphertext in the storage system. The overhead of the key man-agement is mainly moved to the servers because most of the keys are stored in storage servers (in an encrypted form) except for the user’s secret key.

2.3.4 Motivation

To my best knowledge, few research addresses the data confidentiality against the collusion of all storage servers in a decentralized networked storage system that uses erasure codes. Here is the place where my results fill in. We provide a secure cloud storage system that provides a strong data confidentiality in a decentralized environment and a good data availability by using erasure codes. Our key technique is combining a public key encryption scheme and a variant of random linear code. As a result, the data are stored in an encrypted and encoded form in each storage server and no storage server has the decryption key. The access right management is totally controlled by the data owner. The data confidentiality is fully guaranteed even if all storage servers are corrupted at the same time.

(39)

Chapter 3 Erasure Codes and System

Models

We introduce our basic algebraic notations and the erasure codes we used in this chapter. The special erasure code is one of our key techniques to achieve both robustness and parallelism in our cloud storage system. We consider that there is no central authority in the collection of storage servers and introduce our first system model and an advanced one. We also describe the threat model to measure the security degree of the cloud storage systems.

3.1 Bilinear Map and Assumptions

Bilinear map. Let G1, G2 be cyclic multiplicative groups1 with prime

or-der p and g ∈ G1 be a generator. A polynomial-time computable map

˜

e : G1 × G1 → G2 is a bilinear map if it has the bilinearity and

(40)

degeneracy: for any x, y ∈ Zp, ˜e(gx, gy) = ˜e(g, g)xy and ˜e(g, g) is not the

identity element in G2. In fact, ˜e(g, g) is a generator of G2. Let Gen(1λ)

be an algorithm generating (p, G1, G2, ˜e, g), where λ is the length of p.

Let x ∈RX denote that x is randomly chosen from the set X.

Bilinear Diffie-Hellman assumption. Following the above parameters, given g, gx_{, g}y_{, g}z_{, where x, y, and z are randomly chosen from Z}

p, the

bilinear Diffie-Hellman problem is to find ˜e(g, g)xyz_{. The assumption is that it}

is hard to solve the problem with a significant probability in polynomial time. Formally, for any probabilistic polynomial time algorithm A, the following probability is negligible (in λ):

Pr[A(g, gx, gy, gz) = ˜e(g, g)xyz_{: x, y, z ∈}RZp]

Decisional Bilinear Diffie-Hellman assumption. This assumption is that given g, gx_{, g}y_{, g}z_{, it is hard to distinguish ˜}_{e(g, g)}xyz _{from a random element}

from G2. Formally, for any any probabilistic polynomial time algorithm A,

the following is negligible (in λ):

| Pr[A(g, gx, gy, gz, Qb) = b : x, y, z, r ∈R Zp;

Q0 = ˜e(g, g)xyz; Q1 = ˜e(g, g)r; b ∈R {0, 1}] − 1/2|

3.2 Erasure Codes over Exponents

The erasure codes we used can be seen as a variant of the traditional random linear codes. We briefly review the random linear codes and present the

(41)

erasure codes over exponents in this section.

3.2.1 Random Linear Codes

Let the message be ~I = (m1, m2, . . . , mk), the generator matrix G = [gi,j]_{1≤i≤k,1≤j≤n}

and the codeword be ~O = (w1, w2, . . . , wn). The elements of ~I and ~O and

entries of G are all over a finite field F of size p. The generator matrix of a random linear code has random entries from the finite field. As a result, each element of ~O is a linear combination of ~I where the coefficients are randomly chosen from F.

A decentralized erasure code [34] is a random linear code with a sparse generator matrix. The generator matrix G of a decentralized erasure code constructed by an encoder is as follows. First, for each row, the encoder randomly marks an entry as 1 and repeats this process for an ln k/k times with replacement (an entry can be marked multiple times), where a is a constant. Second, the encoder randomly sets a value from F for each marked entry. The encoding process is expressed as ~_{I ·G = ~}O. As for the decoding, a decoder receives k columns j1, j2, . . . , jkof G and the corresponding codeword

elements wj1, wj2, . . . , wjk. The decoding process is computed as follows:

[m1, m2, . . . , mk] = [wj1, wj2, . . . , wjk]          g1,j1 g1,j2 · · · g1,jk g2,j1 g2,j2 · · · g2,jk · · · · gk,j1 gk,j2 · · · gk,jk          −1

(42)

Figure 3.1: A storage system using the random erasure code.

Messages M1 and M2 are randomly distributed to storage servers SS1, SS2, and

SS3. The storage server SS1 randomly selects a coefficient g1,1 for the received

message. Similarly, the storage servers SS2 and SS3 individually select

coeffi-cients.

k chosen columns is invertible. Thus, the probability of a successful decoding is the probability of the chosen submatrix being invertible. It has been shown in [34] that the probability is at least 1 − k/p − o(1), where the randomness is introduced by the random choices for marked entries, the random values for marked entries, and the random choices for k columns.

3.2.2 Random Erasure Codes over Exponents

We fix a cyclic multiplicative group G with prime order p. The message domain is G. The generation of the generator matrix G is the same as the above decentralized erasure code except that the entries of G are over Z_p. The encoding process is to generate w1, w2, . . . , wn ∈ G, where wi =

mg1,i

1 m g2,i

2 · · · m gk,i

k . An example is shown in Figure 3.1. There are 2 messages

stored into 3 storage servers. The first step of the decoding process is to com-pute the inverse of a k × k submatrix K of G. Let K−1 _{= [d}

(43)

second step of the decoding process is to compute mi = wjd11,iw

d2,i

j2 · · · w

dk,i

jk ,

where j1, j2, . . . , jk are the indices of columns of K in G. Therefore, a

suf-ficient condition for a success decoding of the variant decentralized erasure code is that the k × k submatrix K is invertible. Similar to the decentralized erasure code, the probability of a success decoding is at least 1 − k/p − o(1). Since the decoder only requires k columns of G and their corresponding codeword elements to decode, this code is resilient to (n − k) erasure errors. Moreover, the code is decentralized because each codeword element wi can

be independently generated. A distributed networked storage system having n servers uses a random erasure code as follows. The owner wants to store k messages Mi, 1 ≤ i ≤ k. For each Mi, the owner randomly selects v

servers with replacement and sends a copy of Mi to each of them. Each

server randomly selects a coefficient for each received message and performs a linear combination of all received messages. Those coefficients chosen by a server form a column of the matrix and the result of the linear combination is a codeword element. Because there are n servers, a k×n generator matrix and a codeword are implicitly formed. Each server can perform the computation independently. This makes the code decentralized.

3.3 System Models

We first consider a basic storage system model which is capable for the funda-mental functions, i.e. storing and retrieval. Later, we extend the system by adding the data forwarding function. The advanced system model supports the data forwarding function such that the both the owner forwards and the

(44)

Figure 3.2: Our first system model of our secure cloud storage system.

Messages are encrypted and then randomly distributed amon the storage servers. Each storage server performs the combination on received ciphertexts and only stores the result and chosen coefficients.

granted user retrieves data in a confidential way.

3.3.1 The First System Model

Figure 3.2 provides an overview of our first system model. There are n storage servers SS1, SS2, ..., SSn and m key servers KS1, KS2, ..., KSm. The storage

servers provide storage services and the key servers provide key management services. The system consists of 3 phases: system setup, data storage, and data retrieval. They are described as follows.

In the system setup phase, the system chooses and computes public pa-rameters. A user A has his own storage space, his public key PKA and secret

key SKA. The user A publishes his public key and shares his secret key to

a set of key servers by his own choice with a threshold value t. As a result, each chosen key server KSi, 1 ≤ i ≤ m, holds a key share SKA,i of the user’s

secret key SKA.

(45)

Figure 3.3: The advanced system model of our secure cloud storage system.

User A generates a re-encryption key RKA→B,Fid and distributes it to all storage

servers such that the storage servers re-encrypt the ciphertexts into ones under user B’s key.

i ≤ k, into n storage servers SSi, 1 ≤ i ≤ n. We could think that these

messages are the segments of a file. For those k messages, A assigns a message identifier. Each message Mi is encrypted under the public key PKA as Ci =

E(PKA, Mi). Then, each ciphertext is sent to v storage servers, where the

storage servers are randomly chosen. Each storage server SSi combines the

received ciphertexts by using the erasure code to form the stored data σi.

In the data retrieval phase, to retrieve the k messages, A instructs the m key servers such that each key server retrieves stored data from u storage servers and does partial decryption for the retrieved data. Then, A collects the partial decryption results, called decryption shares, from the key servers and combines them to recover the k messages.

3.3.2 Advanced System Model

Our advanced system model is illustrated in Figure 3.3. Again, the system model consists of users, n storage servers SS1, SS2, ..., SSn and m key servers

(46)

model supports one more important function – data forwarding. Hence, the storage system consists of 4 phases: system setup, data storage, data forwarding, and data retrieval. The only different phases from the first system model are the data forwarding and the data retrieval. Thus the two phases are described as follows.

In the data forwarding phase, user A can forward the data D to another user B. User A computes a re-encryption key from A to B respect to the data D and sends it to all storage servers. After getting the re-encryption key, each storage server re-encrypts the data D of user A. The re-encryption operation transfers the ciphertext of D to a ciphertext for B. As a result, the re-encrypted data can be decrypted by using B’s secret key. We say that the originally encrypted data as level-0 ciphertexts and the re-encrypted data as level-1 ciphertexts.

In the data retrieval phase, user A retrieves the data from the system. The data either belong to the user A or are forwarded to him. First, the user sends a retrieval request to all key servers. Upon receiving the user’s request of retrieval, a key server queries a set of u storage servers. The queried storage servers will send the requested messages (in encrypted and encoded form) and the coefficients back to the key server. After receiving messages from the queried storage servers, the key server performs the partial decryption by using the key share and forwards the results to user A. As long as at least t key servers reply to user A’s request, user A can retrieve messages with an overwhelming probability.

A small example of a commercial company is illustrated in Figure 3.4. A manager A stores his data in the storage system and classifies the data by

(47)

Figure 3.4: A storage system using the secure decentralized erasure code. different clients to whom data are associated. One day, the manager wants to assign the case of client 1 to his employee B. The manager securely forwards the data associated with client 1 to B. Afterward, only B can access those forwarded data.

3.4 Threat Model

In this system model, we consider that an attacker wants to corrupt the data confidentiality of a target user and he colludes with all storage servers and up to (t − 1) key servers. We assume that the attacker will not tamper the stored data but he will try to get the data content from the stored data. We model this attack by the standard chosen plaintext attack of the underlying encryption scheme in a threshold version.

3.4.1 Model without Forwarding

For our first system model, we extend the standard chosen plaintext attack (CPA) security game for the threshold public key encryption scheme. The threshold CPA security game consists of a challenger C and an attacker A.

(48)

– Run Setup(λ) to get µ = (p, G1, G2, ˜e, g).

– Run KeyGen(µ) to get a key pair (PK, SK) and run ShareKeyGen on (SK, t, n) to get SKi, 1 ≤ i ≤ n, where t and n are randomly

chosen.

– _{Send (µ, PK, t, n) to A.}

• Key share query: A queries (t − 1) secret key shares from C and gets SKq1, SKq2, . . ., SKqt−1, where q1, q2, . . . , qt−1∈ [1, n].

• Challenge: A chooses two messages M0 and M1, where M0 6= M1, and

sends them to C. C encrypts Mb as C, where b is randomly selected

from {0, 1}, and sends C to A.

• Output: A outputs a bit b′ _{for guessing b.}

The advantage of A is defined as AdvA = | Pr[b′ = b] − 1/2|. A threshold

public key encryption scheme is CPA secure if and only if for any probabilistic polynomial time algorithm A, AdvA is a negligible function in λ. A cloud

storage system in the basic system model is secure if the used threshold public key encryption scheme is secure.

3.4.2 Model with Forwarding

For the advanced system model, the security game is a little bit different from the previous one for the first system model. We consider that an attacker wants to corrupt the data confidentiality of a target user with respect to an identifier and he colludes with all storage servers and up to (t−1) key servers.

(49)

Figure 3.5: The security game for the chosen plaintext attack.

Not only secret key shares but also re-encryption keys are queried by the attacker in the security game.

We model this attack by the standard chosen plaintext attack of the proxy re-encryption scheme in a threshold version.

This game is the same with the previous one except the steps in the key share query. In the key share query, in addition to (t − 1) secret key shared of T, the attacker can also query all re-encryption keys except those re-encryption keys from T to other users. In the challenge phase, the attacker can choose the message identifiers for each of the chosen messages.

Figure 3.5 shows the full security game for the chosen plaintext attack in a threshold version. The challenger provides the system parameters. After the attacker chooses a target user T, the challenger gives (t − 1) key shares of the secret key SKTof the target user T to the attacker. This step models that

the attacker colludes with (t − 1) key servers. Then the attacker can query all re-encryption keys except those re-encryption keys from T to other users. In the challenge phase, the attacker chooses two messages M0 and M1 and

(50)

random coin b and encrypts the message Mb with T’s public key PKT and the

identifier Fidb. After getting the ciphertext from the challenger, the attacker

outputs a bit b′ _{for guessing b. In this game, the attacker wins if and only if}

b′ _{= b. The advantage of the attacker is defined as |1/2 − Pr[b}′ _{= b]|.}

A cloud storage system in the advanced system model is secure if no probabilistic polynomial time attacker wins the game with a non-negligible advantage.

(51)

Chapter 4 Secure Cloud Storage System

At the starting point, we consider a secure cloud storage system, which sup-ports the basic functions – data storing and retrieval. We design a threshold public key encryption scheme to protect the confidentiality of the stored data. However, it is not easy to maintain the decentralized structure of the whole network. One of the differences of our threshold encryption scheme from other ones is that the partial decryption is independently done by each key server. In this section, we present the threshold public key encryption scheme and the secure storage system that employs the encryption scheme. We also analyze the performance of the storage system and show that our storage system is secure.

4.1 Threshold Public Key Encryption

A threshold public key encryption consists of 6 algorithms: SetUp, KeyGen, ShareKeyGen, Enc, ShareDec, and Combine. SetUp generates the public

(52)

pa-Figure 4.1: The primitive encryption scheme.

This is the primitive encryption scheme. We modify it into a threshold version.

Figure 4.2: The flowchart of the threshold encryption scheme.

The flow of encryption and partial decryption of the threshold encryption scheme. Message M is encrypted as C. The decryption is performed by m partial decryption processes and a final combine process.

rameters of the whole system, and KeyGen generates a key pair, consisting of a public key PK and a secret key SK, for each user. Each user uses ShareKey-Gen to share his secret key into m secret key shares such that any t of them can recover the secret key. Enc encrypts a given message by a public key PK, and outputs a ciphertext. ShareDec partially decrypts a given ciphertext by a secret key share and outputs a decryption share. Combine takes a set of decryption shares as input and outputs the message if and only if there are at least t decryption shares.

Figure 4.1 gives the non-threshold primitive encryption scheme. We mod-ify this primitive and propose a threshold public key encryption scheme Π using bilinear maps as follows. Figure 4.2 shows a flow diagram of the en-cryption and partial deen-cryption of the enen-cryption schemes.

• SetUp(1λ_{). To generate µ, run Gen(1}λ_{) and set µ = (p, G}

(53)

• KeyGen(µ). To generate a key pair for a user, select x ∈R Zp and set

PK = gx_{, SK = x.}

• ShareKeyGen(SK, t, m). The secret key shares SKi = f (i) are derived

by the polynomial f (z), where

f (z) =SK + a1z + a2z2+ · · · + at−1zt−1 (mod p),

and a1, a2, . . . , at−1 ∈RZp.

• Enc(PK, M). To generate a ciphertext C of the message M ∈ G2,

compute

C = (α, β, γ) = (gr, h, M ˜e(gx, hr)),

where r ∈R Zp, and h ∈R G1.

• ShareDec(SKi, C). Let C = (α, β, γ). By using the secret key share

SKi, a decryption share ζi of C is generated as follows.

ζi = (αi, βi, βi′, γi) = (α, β, βSKi, γ)

• Combine(ζi1, ζi2, . . . , ζit). It combines the t values (β

′ i1, β ′ i2, . . . , β ′ it) to

obtain βSK_{= β}f (0) _{via Lagrange interpolation over exponents:}

βSK=Y

i∈S

(β_i′)Qr∈S,r6=ir−i−i

where S = {i1, i2, . . . , it} and ζij = (αij, β, (β)

′

(54)

The output message is M = γ/˜e(α, βf (0)_).

When a fixed h is used for a set of ciphertexts, the set of those ciphertexts are multiplicative homomorphic. The multiplicative homomorphic property is that given a ciphertext for M1 and a ciphertext for M2, a ciphertext for

M1 × M2 can be generated without knowing the secret key x, M1, and M2.

Let C1 = Enc(PK, M1) and C2 = Enc(PK, M2), where

C1 = (gr1, h, M1˜e(gx, hr1)) and

C2 = (gr2, h, M2˜e(gx, hr2)).

A new ciphertext C which is an encryption of M1× M2 under the public key

PK is computed as follows: C = (gr1 gr2 , h, M1˜e(gx, hr1)M1e(g˜ x, hr2)) = (gr1+r2_{, h, M} 1M2˜e(gx, hr1+r2))

4.2 System Construction

We assume that there are n storage servers which store data and m key servers which own secret key shares and perform partial decryption. We consider that the owner has the public key PK = gx _{and shares the secret key x to}

m key servers with a threshold t, where m ≥ t ≥ k. Let the k messages be M1, M2, . . . , Mk. We use hID= H(M1||M2|| · · · ||Mk) as the identifier for this

set of messages, where H : {0, 1}∗ _{→ G}

1 is a secure hash function.

(55)

Figure 4.3: The storage process of our first secure cloud storage system.

The messages are encrypted and distributed among the storage servers. Each stor-age server combines all received ciphertexts and stores the result and the chosen coefficients.

are described in the following.

• Storage process. To store k messages, the storage process is as follows: 1. Message encryption. The owner encrypts all k messages via the threshold public key encryption Π with the same hID, where hID=

H(M1||M2|| · · · ||Mk) is the identifier for the set of messages M1, M2, . . . , Mk.

Let the ciphertext of Mi be

Ci = (αi, β, γi) = (gri, hID, Mie(g˜ x, hrIDi)),

where ri ∈R Zp, 1 ≤ i ≤ k.

2. Ciphertext distribution. For each Ci, the owner randomly chooses

v storage servers (with replacement) and sends each of them a copy of Ci.

3. Decentralized encoding. For all received ciphertexts with the same message identifier hID, the storage server SSj groups them as Nj.

(56)

for each Ci ∈ Nj and sets gi,j = 0 for Ci 6∈ Nj. This step forms a

generator matrix G = [gi,j]1≤i≤k,1≤j≤n of the decentralized erasure

code.

Each storage server SSj computes the following (Aj, Bj),

Aj = Y Ci∈Nj αgi,j i and Bj = Y Ci∈Nj γgi,j i and stores σj = (Aj, hID, Bj, (g1,j, g2,j, . . . , gk,j)).

In fact, (Aj, hID, Bj) is a ciphertext for Q1≤i≤kM gi,j i since (Aj, hID, Bj) = ( Y Ci∈Nj (gri₎gi,j_{, h} ID, Y Ci∈Nj (Mie(g˜ x, hrIDi)) gi,j₎ = (g Q Ci∈Njrigi,j , hID, ( Y Ci∈Nj Mgi,j i )˜e(gx, h Q Ci∈Njrigi,j ID )) = (gr˜, hID, ( Y Ci∈Nj Mgi,j i )˜e gx, hrID˜ ), where ˜r =Q Ci∈Njrigi,j.

• Retrieval process. To retrieve k messages, the retrieval process is as follows:

1. Retrieval command. The owner sends a command to the m key servers with the message identifier hID.

(57)

2. Partial decryption. Each key server KSi randomly queries u

stor-age servers with the messstor-age identifier hID and obtains at most u

stored data σj from the storage servers. Then the key server KSi

performs ShareDec on each received ciphertext by its secret key share SKi to obtain a decryption share of the ciphertext. Assume

that KSireceives σj. KSidecrypts the ciphertext (Aj, hID, Bj) as a

decryption share ζi,j = (Aj, hID, hSK_IDi, Bj), and sends the following

to the owner:

˜

ζi,j = (Aj, hID, hIDSKi, Bj, (g1,j, g2,j, . . . , gk,j))

3. Combining and decoding. The owner chooses ˜ζi1,j1, ˜ζi2,j2, . . . , ˜ζit,jt

from all received data ˜ζi,j and computes hSKID = h f (0)

ID = hxID by the

Lagrange interpolation over exponents, where i1 6= i2 6= · · · 6= it

and S = {i1, i2, . . . , it}: hxID = Y i∈S hSKi ID Q

r∈S,r6=ir−i−i

If the number of the received ˜ζi,j is more than t, the owner

ran-domly selects t out of them. If the number is less than t, the retrieval process fails. After having hx

ID, the owner reconsiders all

received data and chooses ˜ζi1,j1, ˜ζi2,j2, . . . , ˜ζik,jk with j1 6= j2 6=

(58)

(i, j) ∈ {(i1, j1), (i2, j2), . . . , (ik, jk)}: wj = Bj ˜ e(Aj, hxID) = Y Cl∈Nj Mgl,j l (4.1)

The owner then computes

K−1 = [di,j]_1≤i,j≤k,

where K = [gi,j]_{1≤i≤k,j∈{j}₁_,j₂_,...,j_k_}. If K is not invertible, the

re-trieval process fails. Otherwise, the owner successfully obtains Mi,

1 ≤ i ≤ k, by the following computation: wd1,i j1 w d2,i j2 · · · w dk,i jk = M Pk l=1g1_,jldl,i 1 M Pk l=1g2_,jldl,i 2 · · · M Pk l=1gk,jldl,i k = Mτ1 1 Mτ 2 2 · · · M τk k = Mi,

where τr =Pk_l=1gr,jldl,i = 1 if r = i and τr = 0 otherwise.

An example is given in Figure 4.4. In the ciphertext distribution step, the ciphertext C1 is distributed to SS1, SS2, and SS3. The ciphertext C2 is

distributed to SS2 and SS3 only. After receiving ˜ζ1,1, ˜ζ1,2, ˜ζ2,2, and ˜ζ2,3, the

owner computes hx

ID from ˜ζ1,1 and ˜ζ2,2. By using hxID, the owner computes

the encoded messages, Mg1,2

1 M g2,2 2 and M g1,3 1 M g2,3

2 , and decodes them to get

messages M1 and M2.

(59)

Figure 4.4: A storage system using the secure decentralized erasure code. decryption process can be performed before the decoding process. Secondly, the decryption process can be performed by the key servers independently. The first technique comes from the multiplicative homomorphic property of our encryption scheme. For those k messages, a fixed message identifier hID

is used. As a result, the set of ciphertexts is multiplicative homomorphic. An encoding result of ciphertexts C1, C2, . . . , Ck is also an encryption of an

encoding result of messages M1, M2, . . . , Mk. As for the second key technique,

the design of the encryption scheme embeds the decryption power at the value hx

ID, while hID is the message identifier. With hxID, the owner can decrypt all

ciphertexts marked with the message identifier hID. A key server KSi can

compute a share hSKi

ID of hxID. With at least t key servers, hxIDcan be computed.

4.2.1 Correctness

The correctness is that the owner A correctly retrieves the messages with an overwhelming probability. The correctness of the encryption and the decryption for user A is that any ciphertext C′ = (gr, hID, w˜e(gx, hrID)) can

(60)

who have shares of SKA. This correctness can be seen from Equation (4.1).

The user combines t decryption shares and then correctly gets the encoded message.

4.3 Analysis

We analysis the performance, the probability of a successful retrieval, and the security of the secure cloud storage system.

4.3.1 Performance Analysis

We analyze the computation cost and the storage cost. Let the bit-length of the element in the group G1 be l1 and G2 be l2.

Computation cost. We measure the computation cost in the number of pairing operations, modular exponentiations in G1 and G2, modular

mul-tiplications in G1 and G2, and arithmetic operations over GF (p). Those

operations are denoted as Pairing, Exp1, Exp2, Mult1, Mult2, and Fp,

respec-tively. We consider the cost for k messages together since the storage process and retrieval process are designed for a set of k messages. The cost is listed in Table 4.1. In fact, Fp has much lower cost than Mult1and Mult2. One Exp1

is about 1.5⌈log2p⌉ Mult1 on average (by using the fast square and multiply

algorithm). That is, when p is about 1000 bits, one Exp1 is about 1500 Mult1

on average. Similarly, Exp2 is about 1.5⌈log2p⌉ Mult2 on average.

Since in practice the coefficients can be chosen from a smaller set, the measure of the computation cost of the Exp1 and Exp1 is an over-estimation.

(61)

Operations Computation cost Message encryption

(for k messages) k Pairing + 2k Exp1 + k Mult2 Decentralized encoding

(for each SS) k Exp1 + k Exp2 + (k − 1) Mult1 + (k − 1) Mult2 Partial decryption

(for t KS) t Exp1

Combining k Pairing + k Mult2 + O(t2) Fp

Decoding k2 _Exp

2 + (k − 1)k Mult2 + O(k3) Fp

- Pairing: a pairing computation of ˜e.

- Exp1 and Exp2: a modular exponentiation computation in G1 and G2,

respectively.

- Mult1 and Mult2: a modular multiplication computation in G1 and G2,

respectively.

- Fp: an arithmetic operation in GF (p).

Table 4.1: Computation cost of each step in our first secure storage system. some improved algorithms [48, 49] are proposed for accelerating the pairing computation.

In the storage process, for each message encryption, generating αirequires

one Exp1, and generating γi requires one Exp1, one Pairing, and one Mult2.

Hence, in the message encryption step for k messages, the cost is (k Pairing + 2k Exp1 + k Mult2). In the ciphertext distribution step, no computation

occurs. In the encoding step, each SSiencodes all received messages. Here we

use a worse cast estimation that each SSi receives k messages. To compute

Ai, SSi requires k Exp1 and (k − 1) Mult1 while to compute Bi, the cost is k

Exp2 and (k − 1) Mult2.

For the partial decryption step, each KSi performs one Exp1 to get hSKIDi.

(62)

step, we consider the total cost of t key servers. That is t Exp1. For the

combining and decoding step, we split it into two sub-steps: the combining sub-step and the decoding sub-step. The combining sub-step includes the computation of hx

ID and the computation of codeword elements wj’s from the

decryption shares ˜ζi,j’s. The computation of hxID is a Lagrange interpolation

over exponents in G1, which requires O(t2) Fp, t Exp1 , and (t − 1) Mult1.

Computing wj from Aj, Bj, and hxID requires one Pairing and one modular

division, which takes 2 Mult2. The decoding sub-step includes the matrix

inversion and the computation of messages Mi’s from codeword elements

wj’s. The matrix inversion takes O(k3) arithmetic operations over GF (p),

and the decoding for each message takes k Exp2 and (k − 1) Mult2.

Storage cost. The storage cost in a key server for a user is ⌈log2p⌉ because

the key server only requires to store the secret key share. The main storage cost lies on the storage servers.

We measure the storage cost in bits as the average cost in a storage server for a message bit. To store k messages, each storage server SSj stores

(Aj, hID, Bj) and the coefficient vector (g1,j, g2,j, . . . , gk,j). The total cost in a

storage server is (2l1+ l2+ k⌈log2p⌉) bits, where Aj, hID∈ G1, and Bj ∈ G2;

hence, the average cost for a message bit is (2l1 + l2 + k⌈log2p⌉)/kl2 bits,

which is dominated by ⌈log2p⌉/l2 for a sufficient large k. In practicality, gi,j’s

are chosen from a much smaller set than Zp. Then we can use fewer bits to

(63)

4.3.2 Successful Retrieval Probability

When n and k are fixed, u and v affect the probability of a successful retrieval. We investigate the relations of these parameters for the success probability. The results are given in Theorem 1 and Theorem 2.

To retrieve all k messages, the key servers have to get k stored data σj1,

σj2, · · · , σjk from k different storage servers SSj1, SSj2, . . . , SSjk and apply

ShareDec to acquire ˜ζi1,j1, ˜ζi2,j2, . . ., ˜ζik,jk. Furthermore, a k × k matrix K

formed by the coefficient vectors in ˜ζi1,j1, ˜ζi2,j2, . . ., ˜ζik,jk needs to be invertible

in order to solve the k messages. The random process is on the selection of distinct SSj1, SSj2, . . ., SSjk by the key servers and the coefficient vectors in

σj1, σj2, . . ., and σjk. Let E1 be the event that less than k distinct storage

servers are queried by the key servers. For the generator matrix G implicitly generated by the owner and the storage servers, let E2 be the event that the

submatrix K of k columns j1, j2, . . . , jk of G is non-invertible. The Figure 4.5

shows the probability space of the successful retrieval event. The outer circle presents the sample space. The solid circle presents the event E1 and the

inner circle shows the event E2. The event of a successful retrieval is showed

as the shadow area. Thus, the probability of a successful retrieval by the owner is

1 − Pr[E1] − Pr[E2|E1] Pr[E1] (4.2)

We analyze suitable settings of m, v, and u, where n = ak3/2 _{and n = ak,}

respectively and the results are listed in the following:

具有資訊隱密性與容錯能力的分散式雲端儲存系統

Data Confidentiality and Robustness in Decentralized

Cloud Storage Systems

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Review of Networked Storage

Systems

2.1

Centralized Data Management

2.1.1

Mirrors and Replicas

2.1.2

Erasure Codes

2.2

Decentralized Data Management

2.2.1

Mirrors and Replicas

2.2.2

Erasure Codes

2.2.3

Hybrid Strategy

2.3

Challenge of Data Confidentiality

2.3.1

Cleartext Storage

2.3.2

Symmetric Encryption

2.3.3

Public Key Encryption

2.3.4

Motivation

Chapter 3

Erasure Codes and System

Models

3.1

Bilinear Map and Assumptions

3.2

Erasure Codes over Exponents

3.2.1

Random Linear Codes

3.2.2

Random Erasure Codes over Exponents

3.3

System Models

3.3.1

The First System Model

3.3.2

Advanced System Model

3.4

Threat Model

3.4.1

Model without Forwarding

3.4.2

Model with Forwarding

Chapter 4

Secure Cloud Storage System

4.1

Threshold Public Key Encryption

4.2

System Construction

4.2.1

Correctness

4.3

Analysis

4.3.1

Performance Analysis

4.3.2

Successful Retrieval Probability