分散式網路編碼儲存系統預防線路竊聽問題之研究

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

分散式網路編碼儲存系統預防線路竊聽問

題之研究

A Link Eavesdropping Prevention Problem in

Distributed Network Coded Data Storage

Systems

研究生：廖振宏

指導教授：王蒞君教授

共同指導教授：王國禎教授

(2)

分散式網路編碼儲存系統預防線路竊聽問題之研究

A Link Eavesdropping Prevention Problem in Distributed Network

Coded Data Storage Systems

研究生：廖振宏 Student：Chen-Hung Liao

指導教授：王蒞君 Advisor：Li-Chun Wang

共同指導教授：王國禎 Co-Advisor：Kuo-Chen Wang

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

July 2013

(3)

分散式網路編碼儲存系統預防線路竊聽問題之研究

學生：廖振宏

指導教授：王蒞君

共同指導教授：王國禎

國立交通大學

資訊學院資訊科學與工程研究所

摘要

近年來，雲端運算 (Cloud Computing) 的發展相當的快速，它提供了更

多方便且可擴張的服務，雲端分散式儲存系統就是其中之一。在雲端分散式

儲存系統中，網路編碼 (Network Coding) 技術扮演著關鍵的角色，它具有高

可靠度以及低儲存花費的優點。然而因為其需要更多的遠端修復頻寬

(Remote Repair Bandwidth)，當遠端備份資料中心進行修復時面臨嚴重的線路

竊聽問題。在本篇論文中，針對線路竊聽問題，提出最佳化技術的分析模組，

依使用者不同安全性需求，得到最小資料儲存量的理論值。我們的結果顯示，

使用者安全性需求與儲存花費存在相互影響的理論關係。在此分析模組下，

我們進一步探討使用者安全性需求與其他重要儲存系統參數的設計問題。

(4)

A Link Eavesdropping Prevention Problem in

Distributed Network Coded Data Storage Systems

A THESIS Presented to

The Academic Faculty By

Chen-Hung Liao

In Partial Fulfillment

of the Requirements for the Degree of Master in Computer Science

Institute of Computer Science and Engineering College of Computer Science

National Chiao-Tung University

2013

(5)

Abstract

In recent years, network coding plays a key role in distributed storage systems, be-cause of high reliability, security, and low storage cost. However, network coding-based distributed storage systems face an eavesdropping problem when transmitting the re-pairing data from remote datacenters. This problem is especially crucial in distributed network coded storage systems because more repair bandwidth and repair links are re-quired, compared to conventional replication. In this thesis, we propose an optimization approach to compute the minimum storage according to the required security level. Our numerical results demonstrate that there exists an optimal tradeoff between remote re-pair bandwidth and storage cost. Moreover, we analyze the relation between security level requirement and the number of remote and local storage nodes, storage cost, data reliability, and secrecy capacity.

(6)

Acknowledgments

I would like to thank my parents and younger brother. They always give me endless supports and warm encouragement. I especially thank Professor Li-Chun Wang who gave me many valuable suggestions in my research during these two years. I would not finish this work without his guidance and comments. Professor Li-Chun Wang not only gave me suggestions in research, but also many experience about life. I really learned a lot.

I am deeply grateful to my laboratory mates, Yin-Ming, Shao-Heng, Cheng-Wen, Yi-Tsen, Yu-Chia, Chia-Yu, Gen-Hen ,and junior laboratory mates at Mobile Communi-cation and Cloud Computing Laboratory at the Graduate Institute of CommuniCommuni-cations Engineering and the Graduate Institute of Computer Science and Engineering in National Chiao-Tung University. They provide me much assistance and share much happiness with me.

(7)

iii

List of Tables

2.1 Comparison between replication and erasure coding under the same fault tolerant ability. . . 7 5.1 Simulation Parameters for Tradeoff Curve Between Storage per Node and

Remote Repair Bandwidth With Local/Remote Link Capacity Ratio m = 2. . . 38 5.2 Simulation Parameters for Tradeoff Curve Between Storage per Node and

Security Level Requirement With Local/Remote Link Capacity Ratios m = 2. . . 40 5.3 Simulation Parameters for Tradeoff Curve Between Storage per Node and

Security Level Requirement With Local/Remote Link Capacity Ratios m = 2.5. . . 40 5.4 Simulation Parameters for Tradeoff Curve Between Storage per Node and

(10)

vi

List of Figures

1.1 The eavesdropping problem for data repairing in inter-datacenter. . . 5

2.1 Repair bandwidth using only erasure coding. . . 9

2.2 Repair bandwidth using network coding. . . 11

3.1 Information flow graph for remote repair. . . 16

4.1 Information flow graph for proof of Lemma 1. . . 22

4.2 Illustration for lower-bound of minimum cut. . . 24

4.3 Gaps between security level of different curves and perfect secrecy condi-tion. . . 35

5.1 Optimal tradeoff curve between storage per node and remote repair band-width. . . 39

5.2 Tradeoff between storage per node and security level requirement with lo-cal/remote link capacity ratios m = 2 . . . 42

5.3 Tradeoff between storage per node and security level requirement with lo-cal/remote link capacity ratios m = 2.5 . . . 43

5.4 Comparison between different capacity ratios of local link and remote link . 43 5.5 Tradeoff between storage per node and data reliability with local/remote link capacity ratio m = 2. . . 45

(11)

vii

Glossary of Symbols

• n: number of total storage nodes in the storage system. • k: coding parameter using (n, k) code in the storage system. • Ω: original data object size.

• d: number of the surviving nodes in the storage system.

• NL: number of storage nodes used to store the data in local datacenter.

• NR: number of storage nodes used to store the data in remote datacenter.

• γR(i): remote repair bandwidth for the i-th new-comer.

• γR: constant value of remote repair bandwidth.

• γ(i): total repair bandwidth for the i-th new-comer. • γR: constant value of remote repair bandwidth.

• βL: amount of downloaded bits from local datacenters.

• βR: amount of downloaded bits from remote datacenters.

• m: ratio of capacity of local/remote link.

• ML(i): number of links from local datacenters for the i-th new-comer.

• MR(i): number of links from remote datacenters for the i-th new-comer.

(12)

• MR: constant value of the number of links from remote datacenters.

• λ: security parameter.

• σ: security level requirement. • α: storage per node.

• α∗_{: minimum storage per node according to remote repair bandwidth.}

• fα∗_{: minimum storage per node under required security level.}

• fγR: maximum remote repair bandwidth according to data object size and security

level requirement.

• γR,min: minimum remote repair bandwidth.

• αM BR: storage per node at minimum remote repair bandwidth point.

• αM SR: storage per node at minimum storage point.

• γR,M BR: remote repair bandwidth at minimum remote repair bandwidth point.

(13)

(14)

1

CHAPTER 1 Introduction

1.1 Motivation

With the flexibility of allocating computing and communications resources, cloud com-puting is changing the paradigm of the future development of information and communi-cation technologies. Cloud computing provides on-demand measured services by locommuni-cation independent resource pooling. Cloud storage services are a popular and important cloud application, such as Dropbox and Google Drive. The benefits of moving data into cloud servers include relieving the burden of storage resources, global data access, and avoiding huge expenditure on the infrastructure [1]. A global storage infrastructure, OceanStore can automatically recovers from server and network failures [2]. Users do not have to carry huge amount of data around. They can just access in the cloud, instead.

In a cloud distributed storage system, data are distributed to multiple storage nodes interconnected in a network [3], [4], [5]. Important issues for cloud storage system are data reliability and security [2], [6]. A common approach for enhancing data reliability is to distribute data across multiple datacenters and introduce redundancy to tolerate possible

(15)

failures. Furthermore, the mechanism of repairing failures, repair process, is essential when a storage node does not function well. A new storage node, new-comer, downloads data from other surviving storage nodes, and regenerate data to replace the failed node. During a repair process, the number of bits that a new-comer downloads from surviving storage nodes is called repair bandwidth.

In the literature, different strategies to provide data reliability have been proposed, like replication and erasure coding [7]. Replication is the simplest and most common way for redundancy, which just replicates data with multiple storage nodes. However, erasure coding techniques are shown to achieve higher reliability than replication for the same redundancy [8], [9]. In recent years, network coding techniques, which combine data in intermediate nodes, were proposed to reduce repair bandwidth compared to standard erasure codes [10]. Dimakis et al. further derived a tradeoff between storage and repair bandwidth and showed that network codes can achieve the optimal tradeoff curve [11], [12]. Q. Yu et al. analyzed the tradeoff curve based on Dimakis et al.’s work [13].

However, network coding produces more repair bandwidth than simply replication. Also, a new-comer has to connect to more nodes to download data fragments than con-ventional erasure coding does. Recent studies (e.g., [14], [15], [16]) considered the storage node eavesdropping (malicious node) problem. That is, the storage nodes will be invaded by an eavesdropper or compromised by an adversary during a repair process. If an eaves-dropper observes a node that is added to the system to replace a failed node, it will have access to all the data downloaded during repair, which can potentially compromise the entire information in the system.

1.2 Problem and Solution

In this thesis, instead of focusing on node eavesdropping problem, we address the link eavesdropping problem for cloud inter-datacenter distributed storage systems.

(16)

datacenter scenario represents that data are stored in multiple different datacenters in different regions for increasing reliability as shown in Figure 1.1. By doing this, cloud storage systems can guarantee data accessible and recoverable even if a disaster happens to local datacenter. This methodology is called remote backup. The local datacenter plays an role of cache server for main service and the remote data center is used for remote backup.

There exists a security problem during a repair process in such scenario. When a storage node fails in a local datacenter, a new-comer downloads data fragments from local and remote datacenters in different regions and generates new data fragments to replace the failed node. The number of bits that a new-comer downloads from surviving storage nodes in remote datacenters during a repair process is called remote repair bandwidth. Since the repairing data of remote repair bandwidth are transmitted over the Internet, the communication between the local and remote datacenter can become susceptible to eavesdropping. An eavesdropper can exactly know the original data as long as he/she collects enough network coded data [17]. This thesis focuses on such scenarios where an eavesdropper can gain complete information of remote repair bandwidth. Under this setting, the remote repair bandwidth is a major factor affecting the system security level. This problem is crucial in the network coded distributed storage systems because more repair bandwidth and repair links are required during the repair process. How can we evaluate and reduce the risk of leaking data to eavesdroppers in this case. Is it possible not to reveal any information to eavesdroppers so that the system can achieve perfect secrecy. In this thesis, we show that remote repair bandwidth can be reduced by increasing storage per node and derive the tradeoff curves between remote repair bandwidth and storage. The minimum storage for achieving required security level can be also given. We further show analysis of the relation between security level requirement and important system parameters such as the number of remote and local storage nodes, storage cost, data reliability, and secrecy capacity.

(17)

1.3 Thesis Outline

The rest of this thesis is organized as follows. In Chapter 2, we describe the background of redundancy techniques, network coding, and discuss related work on node eavesdropping problem. Chapter 3 introduces our system model and problem formulation. In Chapter 4, we give the storage optimization analysis and the relation between security requirement and some important system parameters. In Chapter 5, we show and discuss the numerical results for the relation between security requirement and some important system parame-ters. We conclude the thesis and provide some suggestions for future research in Chapter 6.

(18)

Local Data Center

Remote Data Center emote Datemote Dat Re

Re

Remote Repair Bandwidth

Eavesdropper

(19)

6

CHAPTER 2 Background

2.1 Replication

Replication is the simplest and most common way for redundancy in reliable storage systems. When a user stores a data object in a distributed storage system based on replication, the system replicates the source data object into r replicas (r is called replicate ratio) and then these replicas are distributed to the storage nodes. Every storage node stores an entire copy of the source data object. This method, though simple, has huge storage cost. It needs r times storage space to store single data object [9].

2.2 Erasure Coding

Erasure coding is another usual way to generate redundancy. It does not just replicate the data object. In contrast, it first divides the data object into k fragments, and then encodes them into n encoded fragments. Finally, these encoded fragments are distributed to the n storage nodes, where n > k. Any legal user can access any k out of these encoded fragments, and reconstruct the original data object via some computation [8]. Erasure

(20)

Table 2.1: Comparison between replication and erasure coding under the same fault tolerant ability.

Replication (replicate 4) Erasure Coding(7,4)

Fault Tolerant Ability r − 1 blocks (3) n − k blocks (3)

Storage Space k ∗ r blocks (16) n blocks (7)

Repair Bandwidth 1 block (1) k blocks (4)

coding provides higher reliability and costs less storage space than replication. However, erasure coding produces higher repair bandwidth than replication. Table 2.1 shows the comparison between replication and erasure coding under the same fault tolerant ability. We select replicate ratio 4 and (7,4) erasure code to illustrate the comparison.

Here we give an example to illustrate the repair process using only (4,2) erasure code (see Fig. 2.1). Consider a storage system which contains four storage nodes. Assume the size of the data object is 4 MB. Upon the data object is to be stored, it first will be divided into four fragments in equal size, and then encoded into eight fragments. Note that a legal user can collect any four out of these eight fragments to reconstruct the original data object. Second, these fragments will be stored in four storage nodes in

(21)

distributed way. During the repair process, a new-comer can connect to any two storage nodes to download four fragments to reconstruct the original data object. Since each storage node stores two fragments, the storage per node is 2 MB. The new-comer totally downloads four fragments, so the repair bandwidth is 4 MB, which is equal to the original data object size [18].

2.3 Network Coding

Network coding is a generalization of the conventional routing (store-and-forwarding) method [11]. In conventional routing, each intermediate node in the network simply stores and forwards the information received. In contrast, network coding allows the intermediate nodes to generate output data by encoding previously received data. An intermediate node can function as an encoder in the sense that it receives information from all the input links, encodes, and sends information to all the output links [19]. Thus, network coding allows information to be mixed at intermediate nodes. We refer to coding at a node in a network as network coding. Network coding can be used to improve the network robustness [20], [21], [22], network throughput [21], and confidentiality [23].

In recent years, the concept of combining network coding with distributed storage while downloading data fragments has been introduced [11]. Such coding scheme is called Regenerating Code. In this scheme, the repair bandwidth can be reduced rapidly [18]. Figure 2.2 gives another example using network coding. The original data object is en-coded and stored as conventional erasure coding. The difference is that the data fragments to be downloaded are put in packets, and the packets in same storage node are mixed before transmitted to a new-comer.

In Fig. 2.2, a new-comer connects to three storage nodes. The data fragments are mixed and then three packets are transmitted to the new-comer. The storage per node is the same as the example in Fig. 2.1, but the repair bandwidth is reduced to 3 MB. That

(22)

Repair Bandwidth Erasure Coding 4 MB X W Z Y Erasure code Encoding W+2X+Y+2Z W+X+Y+Z 3W+2X+2Y+3Z W+2X+3Y+Z P1=W P3=W+X+Y+Z 1 1 1 1 5W+7X+4Y+6Z 3W+5X+2Y+3Z P2=X P4=W+2X+Y+2Z Original Data Size 4 MB

Storage per node 2 MB

W X Y Z Repair data W W W W W W W W W W W++++++++++++22XXXXXXXXX+++++++++22YYYYYYYYYYYYYY+3 W W W W W W W W W W W W++++++++2+22X222XX+XXXXXXXXXX++++++++++333333YYYYYYYYYYYYYY+Z 1*P1+2*P2+1*P3+1*P4 1*P1+1*P2+2*P3+2*P4

(23)

is, using network coding reduces 25% of the repair bandwidth. Previous work further identified there is a fundamental tradeoff between storage and repair bandwidth [11]. However, as we have shown in the example, network coding still causes more repair bandwidth than using simply replication, and a new-comer has to connect to more nodes to download data fragments than conventional erasure coding.

2.4 Literature Survey

Storage nodes in a distributed storage system may not be secure and may be susceptible to an intruder that can eavesdrop on the nodes and possibly modify their data. The intruder can observe a node that is added to the system to replace the failed node and can access to all the data downloaded during repair, which can potentially compromise the entire information in the system. T.K. Dikaliotis et al. and K. Rashmi et al. indicate the problem of maintaining an encoded distributed storage system when some nodes contain errors or erasures, and provide maximum detectable, tolerable errors and erasures [24], [25]. Y. Wu et al. present techniques for constructing codes that achieve the optimal tradeoffs between storage efficiency and repair bandwidth [26]. S. Jaggi et al. indicate the problem that if the network scheme with network coding contains hidden malicious nodes that can eavesdrop on transmissions and inject fake information, it will cause a decoding error [27], [28]. S. Pawar et al. determine the secrecy capacity (i.e., the maximum amount of data that can be securely stored and made available to a legitimate user without revealing any information to any eavesdropper) of distributed storage systems under repair dynamics [14]. N.B. Shah et al. provide an explicit product-matrix code constructions that achieve information-theoretic secrecy capacity [15]. T. Ernvall et al. study the secrecy capacity of heterogeneous distributed storage systems (i.e., nodes have different storage capacities and different repair bandwidths) in which nodes may be compromised by an eavesdropper [29]. The upper-bounds of the maximum amount of information that can be

(24)

(25)

stored safely on a distributed storage systems against a passive eavesdropper observing a fixed number of nodes is given in [14], [30].

However, all previous works focus on node eavesdropping in the same region cloud datacenter. That is, eavesdroppers can invade cloud datacenters and observe data down-loaded by the new-comer node or data stored in the surviving nodes. In this thesis, we consider inter-datacenter scenario where data are distributed in different datacenters in different regions and identify a link eavesdropping problem when repairing data are transmitted over the untrusted wide area network. This problem is crucial in the network coded distributed storage systems because more repair bandwidth and repair links are required.

(26)

13

CHAPTER 3 System Model and Problem

Formulation

3.1 System Model

3.1.1 System Scenario

We now introduce the system scenario and notations used in this thesis. The consid-ered inter-datacenter scenario consists of a local datacenter and a remote datacenter. We assume that there exists total of n storage nodes in the two datacenters with NL

storage nodes and NR storage nodes in the local datacenter and the remote datacenter,

respectively.

The system scenario is stated as follows. A user uploads a data object of size Ω to the datacenters. The data is encoded by using (n, k) code and then distributed to the storage nodes in local and remote datacenters. Each storage node stores encoded data fragments of size α. We assume there are some storage nodes failed in local datacenter, and d storage nodes still survive in total. For maintaining the same level of reliability, the

(27)

system creates new-comer nodes to replace the failed nodes. The i-th new-comer node connects to the surviving storage nodes with ML(i) nodes from local datacenter and MR(i)

nodes from remote datacenter, so ML(i) + MR(i) = d. A new-comer downloads βL bits

from each node in local datacenter, and βRbits each node in remote datacenter. Without

loss of generality, we let βL= mβR, where m ≥ 1, considering the local link’s capacity is

larger than remote link’s capacity. Furthermore, we define remote repair bandwidth for i-th new-comer as γR(i) which equals to MR(i)βR explicitly.

Finally, we define security parameter as λ, which is the probability that user’s data can be reconstructed by an eavesdropper during repair process, and user-specified security level requirement as σ, which is the security rate that storage system can prevent an eavesdropper from reconstructing original data when he/she can observe remote repair bandwidth. We will introduce both security parameter and security level requirement in section 4.3.

3.1.2 Information Flow Graph

Now we are ready to model the link eavesdropping problem. We first introduce the information flow graph for the considered inter-datacenter scenario. In the later section, we will derive the tradeoff curve by solving an optimization problem subject to a sufficient flow constraint. Finally, we give the minimum storage per node for achieving required security level.

We model the inter-datacenter scenario as an information flow graph. Our model is based on this particular graphical representation of a distributed inter-datacenter storage system. The information flow graph describes how the information of the data object is communicated through the storage network, stored in nodes with limited storage, and reaches reconstruction points at the data collectors.

The information flow graph is a directed acyclic graph consisting of three kinds

(28)

of nodes: a single data source node S, storage nodes component xi

in and xiout, and data

collectors DCi. The i-th storage node in the system is represented by a storage input

node xi

in and a storage output node xiout in the graph. These two node are connected by

a directed edge (xi

in, xiout) with capacity equal to the amount of data stored at the i-th

storage node. The capacity of each storage node is α.

When a storage node failure occurs, a repair process is initiated to repair the failure node. This consequently causes the flow graph to be dynamic and evolving with time. At any given time, the activity of a node in the information flow graph depends on whether the node is failed or not. In the initial state, only the source node S is active and it chooses an initial set of storage nodes which connects to their input nodes (xi

in)

with outgoing directed edges of infinite capacity. From this point afterwards, the original source node becomes inactive and the initial chosen storage nodes become active. When the i-th storage node fails in the system, the corresponding nodes component xi

in and xiout

become inactive in the graph. New-comer nodes join the system and connect to active nodes. The components of the j-th new-comer node are represented as xj_in and xjout with

the edge xjin, x j out

added to the information flow graph. Figure 3.1 shows the information flow graph with new-comer. The new-comer chooses to connect with d active nodes to download the encoded data. If the j-th new-comer node chooses to connect to the i-th storage node, we add a directed edge from xi

into x j

in with capacity equal to the amount of

information communicated from node i to the new-comer. We denote the capacity of this edge as βL and βR if the new-comer connects to a storage node from the local datacenter

and the remote datacenter, respectively. A data collector (DC) is represented by a node connected to k active storage output nodes through infinite capacity links enabling it to download all their stored data and reconstruct the original data object.

(29)

S

Remote Datacenter Local Datacenter ߙ ߙ ߙ ߙ

DC

r ߙ

ߚ௅ : amount of downloaded bits from local datacenters ߚோ : amount of downloaded bits from remote datacenters

data reconstruction R

R

ߙ : # of stored bits per node

r ߚ௅ ߚୖ ߙߙ ߚୖ ݔ_௜௡ଵ _ݔ ௢௨௧ଵ ݔ௜௡ଶ ݔ௢௨௧ଶ ݔ_௜௡ଷ ݔ௢௨௧ଷ ݔ௜௡ସ ݔ௢௨௧ସ ݔ_௜௡ହ _ݔ_௢௨௧ହ λ λ λ λ λ λ

Figure 3.1: Information flow graph for remote repair.

(30)

3.2 Problem Formulation

In an inter-datacenter distributed storage system, when a storage node fails, the repair process is executed. The new-comer node gathers data fragments from local and re-mote datacenters to replace the failure node. Downloading data fragments from rere-mote datacenters may be risky because of eavesdropping, so remote repair bandwidth is an important factor of leaking privacy. Remote repair bandwidth can affect the amount of information that eavesdroppers can obtain. Eavesdroppers can reveal original data object by collecting sufficient information from remote repair bandwidth. The more information they obtain, the higher probability they can decode the original data object.

Therefore, the storage system security level is related to remote repair bandwidth. Our objective is to analyze the minimum storage per node under a required security level. In this thesis, we make four assumptions to discuss our problem:

• While a user uploads his/her data to the storage system, the system coding method (n, k) is decided. After the system finishes storing encoded data, the value of NL

and NR are also decided. Note that we always assume that the eavesdropper has a

complete knowledge of the code and the repair scheme implemented in the system. • We assume a node fails in local datacenter, so the storage system will execute repair

process to add a new-comer node to replace the failed node. Then the number of surviving nodes d can be decided.

• Because the storage system can cache the frequently used data or use proxy server to maintain the data temporarily, data usually tend to be stored more in local datacenters. We consider the case that the number of storage nodes in local datacenter is larger than or equal to k, that is, NL ≥ k.

(31)

consider the worst case that a new-comer node downloads data fragments from all the storage nodes in the remote datacenter, which causes the largest remote repair bandwidth MR(i)βR. Therefore, we let the value of each MR(i) to be a constant

value MR, which equals to NR. Then ML(i) can be written as ML(i) = d − MR(i) =

d − MR= d − (n − NL) = NL− (n − d), which is also defined as a constant value ML.

(32)

19

CHAPTER 4 Analysis

In this chapter, we use optimization technique to analyze the minimum storage per node under the user-specified security level requirement in inter-datacenter distributed storage system. In the following, we first derive the storage optimization constraint. Second, we solve the optimization problem and find the tradeoff between storage per node and remote repair bandwidth. Finally, we give definition of security parameter and secu-rity level requirement, and find the relation between storage per node and secusecu-rity level requirement.

4.1 Storage Optimization Constraint

4.1.1 Minimum Cut in Information Flow Graph

We now introduce the minimum cut of the information flow graph. A cut in the graph between the source S and a fixed data collector node DC is a subset C of edges such that, there is no directed path starting from S to DC that does not have one or more edges in C. The minimum cut is the cut between S and DC in which the total sum of the edge

(33)

capacities is smallest.

4.1.2 Flow Constraint

Here, we derive the flow constraint for the considered optimization problem. Next, we give the solution steps of the optimization problem. We define flow constraint :

Definition 1 (flow constraint) A data collector that reconstruct the original data ob-ject successfully must satisfy this constraint :

mincut(S, DC) ≥ Ω , (4.1)

where Ω is the original data object size. That is, no data collector DC can reconstruct the original data object if the minimum cut in the information flow graph between S and DC is smaller than the original data object size Ω. We know that the information of the original data object must be transmitted from the source to the particular data collector, and every link in the information flow graph can only be used at most once. Since the point-to-point capacity between S and DC is less than the data object size, it can be shown by a standard cut-set bound that the entropy of the data object conditioned on everything observable to the data collector is nonzero. Therefore, it is impossible for the data collector to reconstruct the original data object.

4.1.3 Lower-bound of Minimum Cut

We introduce Lemma 1 (lower-bound of minimum cut) to find the lower-bound of the value of the minimum cut in the information flow graph based on the considered scenario.

Lemma 1 Consider any information flow graph, formed by having initial nodes(including local and remote storage nodes) that connect directly to the source and obtain bits, while

(34)

additional nodes join the graph by connecting to existing nodes and obtaining bits. Any data collector that connects to a k-subset of the output nodes in the graph must satisfy

mincut(S, DC) ≥

k−1

X

i=0

min {α, (ML(i) − i) βL+ MR(i)βR} . (4.2)

We give the proof of Lemma 1 as follows : First, we show that there exists an information flow graph (see Fig. 4.1) where the bound (4.2) is matched with equality. We assume there are initially n nodes labeled from 1 to n in this graph, and then k new-comers labeled as n + 1, ..., n + k are added. The new-comer node n + i + 1 connects to nodes n + i + 1 − d, ..., n + i and a data collector DC connects to the last k nodes, i.e., nodes n + 1, ..., n + k. Consider a cut (E, E) defined as follows. For each i ∈ {0, . . . , k − 1} , if α ≤ (ML(i) − i) βL+ MR(i)βR, then we include xn+i+1out in E. Otherwise, we include

xn+i+1out and xn+i+1in in E. Then we find this cut (E, E) achieves (4.2) with equality.

Second, we show that (4.2) must be satisfied for any graph G formed by adding d in-degree nodes as described above. Consider a data collector DC that connects to a k-subset of output nodes. We want to show that the capacity of any S − DC cut in G has a lower-bound:

k−1

X

i=0

min {α, (ML(i) − i) βL+ MR(i)βR} . (4.3)

Since all the capacities of the incoming edges of DC are infinite, we only need to examine the cuts (E, E) with S ∈ E satisfying

xi

out ∈ E, ∀i ∈ I . (4.4)

Let C denote the edges in the cut, i.e., the set of edges going from E to E. We apply the topological sorting concept in following. There exists a topological sorting in any directed acyclic graph, where a topological sorting (or acyclic ordering) is an ordering of its vertices such that the existence of an edge from vi to vj implies i < j. Let x1out be the

(35)

S

Local Datacenter Remote Datacenter Rem ߙ ߙ ߚ௅ ߚ௅ ߚ௅

DC

ߙߙߙ ߙ ߚ௅ ߚ௅ ௅ ߙߙ ௅ ߚ௅ ߚୖ ߚୖ

{

݀

{

݀ െ 1 ߚୖ

{

݀ െ2 data reconstruction 1 n in

x

n 1 out

x

2 n in

x

n 2 out

x

3 n in

x

n 3 out

x

Figure 4.1: Information flow graph for proof of Lemma 1.

(36)

1. Case 1 : If x1

in ∈ E, then the edge (x1in, x1out) must be in C.

2. Case 2 : If x1

in ∈ E, since x1in has an in-degree of d and it is the topologically fist node

in U, all the incoming edges of x1

in must be in C.

Therefore, these edges related to x1

outwill contribute a value of min{α, ML(0)βL+MR(0)βR}

to the cut capacity. Now we consider x2

out, the topologically second output node in E.

Similar to the above, we consider two cases: 1. Case 1 : If x2

in ∈ U, then the edge (x2in, x2out) must be in C.

2. Case 2 : If x2

in ∈ U, since at most one of the incoming edges of x2in can be from x1out,

incoming edges of x1

in must be in C.

Therefore, these edges related to x2

out will contribute a value of min{α, (ML(1) − 1)βL+

MR(1)βR} to the cut capacity. Following the same reasoning we conclude that for the

i-th node (i = 0, . . . .k − 1) in the sorted set E, either one edge of capacity α or ML(i) − i

edges of capacity βL together with MR(i) edges of capacity βR must be in C. Equation

(4.3) is exactly summing these contributions. Thus we find the lower-bound of the value of the minimum cut. The illustration is in Figure 4.2.

4.1.4 Storage Optimization Constraint

From the flow constraint and Lemma 1, we can obtain Lemma 2:

Lemma 2

k−1

X

i=0

(37)

S

Local Datacenter Remote Datacenter Rem ߙ ߙ ߚ௅ ߚ௅ ߚ௅

DC

ߙߙߙ ߚୖ

ሼ

݀

ሼ

݀ െ ͳ ߚୖ data reconstruction

Min-cut

i=0, minሼߙǡ ሺͳ െͲሻߚ_ଵ൅͵ߚ_ଶሻ} i=1, minሼߙǡ ሺͳ െͳሻߚ_ଵ൅͵ߚ_ଶሻ}

Example: k=2

ͳ烉 mincutሺܵǡ ܦܥሻ ൒ ෌௞ିଵ ߙǡ ܯ_௅ሺ݅ሻ െ ݅ ߚ_௅൅ ܯ_ோሺ݅ሻߚ_ோ ௜ୀ଴ 1 n in

x

+ n 1 out

x

+ 2 n in

x

+ n 2 out

x

+

Figure 4.2: Illustration for lower-bound of minimum cut.

(38)

We give the proof of Lemma 2 as follows : Because

k−1

X

i=0

min {α, (ML(i) − i) βL+ MR(i)βR}

is the minimum value of the minimum cut, we can easily know this value must larger than or equal to the original data object size.

In this thesis, we have Lemma 2 as our storage optimization constraint, we will show the solution in the next section.

4.2 Tradeoff Between Storage per Node and

Remote Repair Bandwidth

4.2.1 Repair Bandwidth

Based on the information flow graph, we analyze the relation between storage per node and remote repair bandwidth via solving an optimization problem. As described in section 3.2, we made some assumptions before solving the optimization problem. We assume a node is failed in local datacenter, the storage system will execute the repair process to add a new-comer node to replace the failed node. One important observation is that the repair bandwidth can be reduced in network coding based storage system while the new-comer communicate with more storage nodes. While the new-new-comer communicates with more storage nodes, the size of each communicated packet becomes smaller fast enough to make the repair bandwidth decrease, as d increase, and therefore, minimal for d = n − 1. Thus, when the new-comer connects to all surviving nodes, i.e., d = n − 1, the repair bandwidth can be reduced most. Most network coded storage systems favor this setting. Also, we consider worst case to make the remote repair bandwidth MR(i)βR maximize.

(39)

So the new-comer should connect to all surviving storage nodes in remote datacenter as possible.

We let MR(i) for all i to be a constant value MR which equals to NR , and then

calculate ML(i) = d−MR(i) = d−MR= d−(n−NL) = NL−(n−d) which is also a constant

value defined as ML. Furthermore, we let βL = mβR, where m ≥ 1, considering the local

link’s capacity is larger than remote link’s capacity without loss of generality. Then the total repair bandwidth is γ(i) = ML(i)βL+ MR(i)βR = βR(dm − mNR+ NR) = γ, and

the remote repair bandwidth is γR(i) = MR(i)βR= MRβR = NRβR= βR(n − NL) = γR.

4.2.2 Storage Optimization Solution Steps

Here, we try to find the whole region of feasible points (α, γR), and then select the one that

minimizes storage α. From section (4.1.4), we have Lemma 2 as our storage optimization constraint :

k−1

X

i=0

min {α, (ML(i) − i) βL+ MR(i)βR} ≥ Ω .

Our storage optimization constraint (4.5) can be explicitly solved as follows:

k−1

X

i=0

min {α, (ML(i) − i) βL+ MR(i)βR} ≥ Ω

⇒

k−1

X

i=0

min {α, mβR(d − MR(i) − i) + MR(i)βR} ≥ Ω

⇒

k−1

X

i=0

min {α, βR(md − mMR(i) − mi + MR(i))} ≥ Ω

⇒ k−1 X i=0 min α, (md − mi MR − (m − 1))γR ≥ Ω , 26

(40)

where

γR= MR(i)βR= MRβR= NRβR .

We simplify notation in order to make it easier to show detailed steps. We let bi = (

md − mk + mi + m MR

− (m − 1))γR, for i = 0, 1, 2, ..., k − 1 . (4.6)

Then the problem is to minimize α subject to the constraint

k−1

X

i=0

min {α, bi} ≥ Ω . (4.7)

The left-hand side of (4.7), as a function of α, is a piecewise-linear function of α

C(α) =                      kα, α ∈ [0,b0] b0 + (k − 1)α, α ∈ (b0,b1] ... b0 + b1+ . . . + bk−2+ α, α ∈ (bk−2,bk−1] b0 + b1+ . . . + bk−1, α ∈ (bk−1,∞) . (4.8)

C(α) is strictly increasing from 0 to its maximum b0+ b1+ . . . + bk−1 value as α increases

from 0 to bk−1 . To find the minimum α such that C(α) ≥ Ω, we let α∗ = C−1(Ω) if

Ω ≤ b0+ b1 + . . . + bk−1 ⇒ α∗ =                  Ω k, Ω ∈ [0, kb0] Ω−b0 k−1 ,Ω ∈ (kb0, b0+ (k − 1)b1] ... Ω − k−2_P j=0 bj,Ω ∈ ( k−2_P j=0 bj + bk−2, k−1_P j=0 bj] . (4.9)

For i = 1, . . . .k − 1 , the i-th condition in the above expression is

α∗ = Ω −i−1P j=0 bj k − i , for Ω ∈ ( i−1 X j=0 bj+ (k − i)bi−1, i X j=0 bj + (k − i − 1)bi) . (4.10)

(41)

Note from definition of {bi} (4.6) that i−1_P j=0 bj = i−1_P j=0 md−mk+mj+m MR − (m − 1) γR = iγR(md−mk+mj+m_M_R − (m − 1)) + (mi(i−1)_2M_R )γR = iγR 2md−2mk−2(m−1)MR+mi+m 2MR = γRg(i) (4.11) and i−1_P j=0 bj + (k − i − 1)bi = γR(i + 1)(2md−2mk−2(m−1)M_2M_R R+m(i+1)+m) +(k − i − 1)γR(2md−2mk+2mi+2m−2(m−1)M R 2MR ) = γRi(2mk−mi−m)+k(2md−2mk+2m−2M R(m−1)) 2MR = γR_f(i)Ω . (4.12)

Then we have expression of f (i) and g(i) as follows :

f (i) = 2ΩMR i(2mk − mi − m) + k(2md − 2mk + 2m − 2MR(m − 1)) , (4.13) and g(i) = i2md − 2mk − 2(m − 1)MR+ mi + m 2MR . (4.14)

We use (4.11) and (4.12) to substitute into (4.10). Hence, we have another expression of (4.10) that is easier to write :

α∗ = Ω − g(i)γR k − i , for Ω ∈ ( ΩγR f (i − 1), ΩγR f (i)) . (4.15)

And we can get the relation between storage per node and remote repair bandwidth, which we will introduce in the next subsection.

(42)

4.2.3 Tradeoff Between Storage per Node and Remote Repair

Bandwidth

In our optimization problem, we fix d, m, MR, and γR and minimize the storage per node

α. Note that parameters n, k, d, MR are integers, and Ω, γR, m are real numbers.

α∗ ∆_{= min α,} subject to :k−1P i=0 minnα, (md−mi MR − (m − 1))γR o ≥ Ω . (4.16)

Our objective function is

min α , (4.17)

and the constraint is

k−1 X i=0 min α, (md − mi MR − (m − 1))γR ≥ Ω . (4.18)

After solving this mixed integer programming problem (4.16) using result (4.15) and changing the interval according to remote repair bandwidth, we get minimum storage α∗ _{related to remote repair bandwidth :}

α∗ =          Ω k, γR∈ [f (0), ∞) Ω−g(i)γR k−i , γR∈ [f (i), f (i − 1)) , (4.19) where f (i) = 2ΩMR i(2mk − mi − m) + k(2md − 2mk + 2m − 2MR(m − 1)) (4.20) g(i) = i2md − 2mk − 2(m − 1)MR+ mi + m 2MR . (4.21)

(43)

Because the function f (i) decreases while i increases, the minimum remote repair band-width can be obtained while i = k − 1. Therefore the minimum remote repair bandband-width is expressed as:

γR,min = f (k − 1) =

2ΩMR

k(2md − mk + m − 2mMR+ 2MR)

. (4.22)

We find two special points that represent the minimum storage and minimum remote repair bandwidth respectively. They are on the two ends of the optimal tradeoff curve (see Fig. 5.1). It can be verified that the minimum storage point is achieved by the pair (αM SR, γR,M SR) = Ω k, ΩMR k(md − mk + m − (m − 1)MR) , (4.23)

and it also can be verified that the minimum remote repair bandwidth point is achieved by (αM BR, γR,M BR) = Ω(mk + 2mMR− 2md − 2MR) k(2md − mk + m + 2MR− 2mMR) , 2ΩMR k(2md − mk + m − 2mMR+ 2MR) . (4.24)

Finally, based on the solution, we can find the optimal tradeoff curve of storage per node and remote repair bandwidth, and we will show the result (see Fig. 5.1).

4.3 Minumum Storage per Node Under Security

Level Requirement

4.3.1 Security Parameter

In this thesis, we define λ as security parameter, which is the probability that user’s data can be reconstructed by an eavesdropper during the repair process. Its value is between 0 to 1.

(44)

For example, we consider a bit sequence “10110100“ transmitted in the network. An eavesdropper has the probability 1/28 to correctly guess the whole bit sequence. If the

four left-hand side bits “1011“ are eavesdropped by the eavesdropper. The eavesdropper has the probability 1/24 to correctly guess the remaining bits in the bit sequence, which

is much higher than the none eavesdropping one.

We know that the remote repair bandwidth MR(i)βR can be eavesdropped, so the

eavesdropper can obtain MR(i)βR amount of information. And the information that the

eavesdropper does not know is Ω−MR(i)βR. Therefore, the probability for a eavesdropper

to correctly guess the remaining bits in the bit sequence is

λ = 1

2Ω−MR(i)βR .

It is also the probability that he/she can know the whole bit sequence.

4.3.2 Security Level Requirement

Next, in a user’s points of view, he/she will specify a security level requirement. In this thesis, the notation is σ. It is the security rate that storage system can prevent an eavesdropper from reconstructing original data when he/she can observe remote repair bandwidth, and its value is between 0 to 1 .

The value 0 represents an eavesdropper can gather whole data object from remote repair bandwidth. On the other hand, the value 1 represents the system is perfect secrecy, which means the probability to for an eavesdropper to guess the entire original data object from remote repair bandwidth is the same as random guess. The higher value of σ means the higher security level requirement, in other words, the user asks for more secure storage service. The security level requirement must be satisfied and hence the security level provided by the storage system must be higher than or equal to σ.

(45)

To normalize the value of σ, it will be divided by 1 − 2−Ω_{. So we define σ :}

σ = 1 − λ

1 − 2−Ω . (4.25)

In order to achieve the security level requirement, the probability for an eavesdropper not to correctly guess original data must be larger than or equal to user-specified security level requirement after normalization, i.e.,

σ ≤ (1 − 1

2Ω−MR(i)βR)

(1 − 2−Ω_{) .} _(4.26)

Then we substitute σ using definition (4.25) into (4.26) and get the remote repair band-width upper-bound under security level requirement :

MR(i)βR = γR(i) ≤ Ω + log2λ , for every i . (4.27)

Remind that MR(i)βR is remote repair bandwidth for the i-th new-comer, we obtain the

relation between remote repair bandwidth and security level requirement. So given Ω and σ, we can get remote repair bandwidth upper-bound. Furthermore, we imply relation between storage per node and security level requirement in the next subsection.

4.3.3 Relation Between Storage per Node and Security Level

Requirement

Based on the definitions in the previous subsections, here we imply relation between storage per node and security level requirement. It is derived as bellow. And further, we can find the minimum storage per node that satisfies user’s security level requirement.

Given data object size Ω and user-specified security level requirement σ. We can calculate λ from (4.25). Next, from (4.27), an eavesdropper can know Ω + log₂λ amount of information at most by observing remote repair bandwidth. We define fγR as maximum

remote repair bandwidth and fα∗ _{as minimum storage per node under the security level}

(46)

requirement. Thus we have fγR= Ω + log2λ and can find the point (fγR, fα∗) on the tradeoff

curve, and finally we find fα∗ _{(minimum storage under security level requirement). This}

is our main result.

Based on (4.19), we have relation between storage per node and security level requirement. f α∗ ₌          Ω k, (Ω + log2λ) ∈ [f (0), ∞) Ω−g(i)(Ω+log2λ)

k−i , (Ω + log2λ) ∈ [f (i), f (i − 1))

(4.28) where f (i) = 2ΩMR i(2mk − mi − m) + k(2md − 2mk + 2m − 2MR(m − 1)) (4.29) g(i) = i2md − 2mk − 2(m − 1)MR+ mi + m 2MR , (4.30)

given data object size Ω and security level requirement σ.

4.4 Upper-bound of Amount of Stored Data Under

Perfect Secrecy

A user may want to store data under perfect secrecy. We further analyze upper-bound of amount of stored data under perfect secrecy. That is, the maximum amount of data that can be securely stored in the storage system without leaking any information to eavesdroppers. We want to know that whether it is possible not to leak any information to eavesdroppers. We use the concept of information theory to analyze the upper-bound as below :

Consider a distributed storage system using (n, k) code with NL ≥ k. Let S be a

(47)

node with H(S) = R. For a new-comer node xi_{, let D}

i and Ci be the random variables

representing its downloaded data and stored content respectively. Assume that the storage nodes x1_{, x}2_{, ..., x}k _{have failed consecutively, and were replaced during the repair process}

by the nodes xn+1_{, x}n+2_{, ..., x}n+k _{respectively. Now suppose that a eavesdropper accesses}

nodes in R = xn+1_{, x}n+2_{, ..., x}n+NR while they were being repaired, and consider a

data collector connected to the nodes in B = xn+1_{, x}n+2_{, ..., x}n+k _{. The reconstruction}

property implies H(S|CB) = 0 , and the perfect secrecy condition implies H(S|DR) =

H(S). We can therefore write

min{(ML(i) − i)βL+ MR(i)βR, α} .

(4.31)

Therefore,

k

X

i=NR+1

min{(ML(i) − i)βL+ MR(i)βR, α} (4.32)

is our upper-bound of amount of stored data under perfect secrecy.

When σ = 1, then λ = 2−Ω_{. It means that there is no remote repair bandwidth}

observed by eavesdroppers. However, there always exists remote repair bandwidth in our scenario since we consider the worst case that the new-comer downloaded data from all the storage node in the remote datacenter. Therefore, the storage system can not achieve perfect secrecy (see the gap (red double arrow) in Fig. 4.3).

(48)

0.15 0.2 0.25 0.3 0.85

0.9 0.95 1

Storage per Node

Security Level Requirement

Local/Remote Link Capacity Ratio m = 2

R/L=1/9 R/L=2/8 R/L=3/7 R/L=4/6 R/L=5/5

(49)

36

CHAPTER 5 Numerical Results and Discussions

In this chapter, we show the optimal tradeoff curve between storage per node and three system parameters (remote repair bandwidth, security level requirement, data reliabil-ity). In addition, we make some discussions about the numerical results. Our common numerical result scenario parameters are defined. We analysis the optimal tradeoff curve in inter-datacenter scenario based on different number of storage nodes in the local and remote datacenter. We use (10, 5) code and assume there are total ten storage nodes in the storage system, i.e. n = 10, k = 5. And the data object size Ω in our scenario is 1 . To discuss the tradeoff curves, we use five different pairs of number of local and remote storage nodes. The five pairs (NL, NR) are (5, 5), (6, 4), (7, 3), (8, 2), and (9, 1). We denote

R/L as the number of storage nodes in the remote/local datacenter to make it easier to observe.

(50)

5.1 Tradeoff Curve Between Storage per Node and

Remote Repair Bandwidth

In this case, we discuss the tradeoff curve between storage per node and remote repair bandwidth. We give the initial parameters of the storage system in Table 5.1. The capacity of local link is two times larger than remote link, i.e., βL = 2 ∗ βR . The

local/remote link capacity ratio m is 2.

The tradeoff curve is shown in Fig. 5.1. The different pairs of number of local and remote storage nodes are corresponding to different mark styles and colors. We have the following discussions:

• Most of all, the storage per node and remote repair bandwidth are in a tradeoff relation, that is, storage per node deceases while remote repair bandwidth increases. It is a strictly decreasing curve. The tradeoff curve changes in different number of remote and local datacenter storage nodes scenarios. The more remote storage nodes, because it causes higher remote repair bandwidth, its corresponding curve is located in the higher remote repair bandwidth value interval.

• The two special points are shown in the curve. All the curves have these two special points. They are also on the two ends of the curve. The minimum storage points in all the curves are located at value 0.2 where these points bring the maximum remote repair bandwidth.

• If we do not differentiate local and remote datacenters as the scenario in [11], the storage per node value is a constant in our result. That is, it does not change with remote repair bandwidth.

• We compare with different curves, under the same remote repair bandwidth. If data is stored in more storage nodes in remote datacenter, the storage per node is larger.

(51)

Table 5.1: Simulation Parameters for Tradeoff Curve Between Storage per Node and Remote Repair Bandwidth With Local/Remote Link Capacity Ratio m = 2.

Parameter Value

Number of total nodes (n) 10

Coding parameter (k) 5

Original data size (Ω) 1

Number of the surviving nodes (d) 9

Amount of downloaded bits from local datacenters (βL) βL= m ∗ βR

Amount of downloaded bits from remote datacenters (βR) 1

Local/remote link capacity ratio (m) 2

(52)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.15 0.2 0.25 0.3 0.35

Remote Repair Bandwidth

Storage per Node

Local/Remote Link Capacity Ratio m = 2

R/L=1/9 R/L=2/8 R/L=3/7 R/L=4/6 R/L=5/5 Minimum Storage Minimum Bandwidth

Figure 5.1: Optimal tradeoff curve between storage per node and remote repair band-width.

On the other hand, under the same storage per node, if data is stored in more storage nodes in remote datacenter, it can cause more remote repair bandwidth. It has higher risk to leak information.

5.2 Tradeoff Curve Between Storage per Node and

Security Level Requirement

In this case, we discuss the tradeoff curve between storage per node and security level requirement. We give the initial parameters of the storage system in Table 5.2 and Table 5.3. In addition, we compare two different capacity ratios of local link and remote link. The values are 2 and 2.5 . So local/remote link capacity ratio m are 2 and 2.5 corresponding to different curves respectively.

(53)

Table 5.2: Simulation Parameters for Tradeoff Curve Between Storage per Node and Security Level Requirement With Local/Remote Link Capacity Ratios m = 2.

Parameter Value

Local/remote link capacity ratio -case I (m) 2

Table 5.3: Simulation Parameters for Tradeoff Curve Between Storage per Node and Security Level Requirement With Local/Remote Link Capacity Ratios m = 2.5.

Parameter Value

Local/remote link capacity ratio -case II (m) 2.5

(54)

The tradeoff curves are shown in Fig. 5.2 and Fig. 5.3, where the values of m are 2 and 2.5 respectively. We have the following discussions:

• Most of all, the storage per node and security level requirement are in a tradeoff relation, that is, storage per node increases while security level requirement increases. It is a strictly increasing curve. We can imply that increasing storage space cost can improve the security level requirement. Under the same link capacity ratio, the tradeoff curve changes in different number of remote and local datacenter storage nodes scenarios. The less remote storage nodes, its corresponding curve is located in the higher security level requirement value interval. Because it causes lower remote repair bandwidth in the case with less remote storage nodes, it can achieve higher security level requirement.

• In Figure 5.2, we can see all curves have vertical asymptotic lines. It has the same phenomenon in Figure 5.3. The vertical asymptotic lines represent the maximum security level that the storage system can achieve. All curves cannot exceed the vertical asymptotic lines. Because there are always information transmitted via remote repair bandwidth which can be observed by eavesdroppers, the security level cannot achieve 1 .

• Compared with different curves in same link capacity ratio, (e.g. Fig. 5.2), to achieve the same security level requirement, it cost more storage space in the case with more remote storage nodes. Because it causes more remote repair bandwidth with more remote storage nodes, it has to cost more storage space to achieve the same security level requirement compared with the less remote storage nodes one.

• We compare the curves with different link capacity ratios (see Fig. 5.4). Under the same security level requirement and the same pair of number of storage nodes in local/remote datacenter. If the link capacity ratio is larger, which means local link

(55)

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 0.2 0.225 0.25 0.275 0.3

Storage per Node

Local/Remote Link Capacity Ratio m = 2 R/L=1/9

R/L=2/8 R/L=3/7 R/L=4/6 R/L=5/5

Figure 5.2: Tradeoff between storage per node and security level requirement with lo-cal/remote link capacity ratios m = 2 .

capacity is much larger than remote link capacity, the storage cost is less. Because larger capacity ratio causes lower remote repair bandwidth in the case with larger link capacity ratio, it can achieve the same security level with less storage space. On the other hand, under the same storage cost and the same pair of number of storage nodes in local/remote datacenter. If the link capacity ratio is larger, it can achieve higher security level, and hence have lower risk to leak information.

5.3 Tradeoff Between Storage per Node and Data

Reliability

In this case, we find another tradeoff between storage per node and data reliability in addition, and discuss this tradeoff. We give the initial parameters of the storage system

(56)

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 0.2 0.225 0.25 0.275 0.3

Storage per Node

Local/Remote Link Capacity Ratio m = 2.5 R/L=1/9

R/L=2/8 R/L=3/7 R/L=4/6 R/L=5/5

Figure 5.3: Tradeoff between storage per node and security level requirement with lo-cal/remote link capacity ratios m = 2.5 .

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 0.2 0.225 0.25 0.275 0.3

Storage per Node

Local/Remote Link Capacity Ratio m = 2 and 2.5 R/L=1/9, m=2 R/L=2/8, m=2 R/L=3/7, m=2 R/L=4/6, m=2 R/L=5/5, m=2 R/L=1/9, m=2.5 R/L=2/8, m=2.5 R/L=3/7, m=2.5 R/L=4/6, m=2.5 R/L=5/5, m=2.5

(57)

in Table 5.4. We find this tradeoff relation in the teadeoff curve of storage per node and security level requirement. The local/remote link capacity ratio m is 2 .

Figure 5.5 shows the tradeoff relation for m = 2 and Ω = 1. We also have the following discussions:

• We know that if data are stored in less centralized way (e.g. (R/L) = (5, 5)), it needs larger storage per node to satisfy the security requirement. We use an example (see Fig. 5.5) to illustrate that there is a tradeoff between storage cost and reliability for different number of remote and local storage nodes. Given a security level requirement, we can find two different storage schemes that satisfy the security level requirement (0.91 in the example) easily. We have R/L = 4/6 and R/L = 5/5 for the minimum storage and the maximum reliability, respectively, since the data stored using the latter storage scheme can be recovered if the entire nodes in the local datacenter are failed (such as a fire disaster) whereas the former cannot.

• It is also an interesting issue about the tradeoff between storage cost and data reliability under same security level requirement in network coding based distributed storage systems in inter-datacenter scenario. How to analyze the data reliability is full of different points of view, we will leave it as future discussion issue.

(58)

Table 5.4: Simulation Parameters for Tradeoff Curve Between Storage per Node and Data Reliability With Local/Remote Link Capacity Ratio m = 2.

Parameter Value

Local/remote link capacity ratio -case I (m) 2

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 0.2 0.225 0.25 0.275 0.3

Storage per Node

Local/Remote Link Capacity Ratio m = 2 R/L=1/9 R/L=2/8 R/L=3/7 R/L=4/6 R/L=5/5 Maximum Reliability Minimum Storage

Figure 5.5: Tradeoff between storage per node and data reliability with local/remote link capacity ratio m = 2.

(59)

46

CHAPTER 6 Conclusions

6.1 Tradeoff Curve Between Storage per Node and

Remote Repair Bandwidth

We have investigated the link eavesdropping problem for repairing network coded data from remote distributed storage. We first presented the information flow graph analysis and showed the fundamental tradeoff curve of remote repair bandwidth and storage per node. Then we derived the minimum storage per node for achieving required security level. Finally, we found that there exist another tradeoff for storage cost and reliability for different number of remote and local storage nodes in the considered scenario. This work is a first step towards understanding the security of distributed storage with inter-datacenter communication. In the future, we will focus on the optimal allocation problem for such system with the consideration of storage cost, security and reliability.

(60)

6.2 Future Research

For the future research of the thesis, we provide the following suggestions to extend our work in distributed storage with inter-datacenter communication :

The consideration of storage cost, security and reliability in cloud datacenter is important. How to find the relation between them, make a best allocation choice for users, and optimal revenue for storage system providers will be an important issue in the future.

(61)

48

Bibliography

[1] Q. He, Z. Li, and X. Zhang, “Study on cloud storage system based on distributed storage systems,” IEEE International Conference on Computational and Information Sciences, pp. 1332–1335, Dec. 2010.

[2] S. Rhea, C. Wells, P. Eaton, D. Geels, B. Zhao, H. Weatherspoon, and J. Kubiatow-icz, “Maintenance-free global data storage,” IEEE Internet Computing, vol. 5, no. 5, pp. 40–49, Sep. 2001.

[3] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G. M. Voelker, “Total recall: System support for automated availability management,” Proceedings of the 1st con-ference on Symposium on Networked Systems Design and Implementation (NSDI), vol. 1, pp. 25–25, 2004.

[4] F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris, “Designing a dht for low latency and high throughput,” Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation (NSDI), pp. 85–98, 2004.

[5] A. G. Dimakis, V. Prabhakaran, and K. Ramchandran, “Decentralized erasure codes for distributed networked storage,” IEEE/ACM Transactions on Networking (TON), vol. 14, no. SI, pp. 2809–2816, 2006.

[6] Y. Singh, F. Kandah, and W. Zhang, “A secured cost-effective multi-cloud storage in cloud computing,” IEEE Conference on Computer Communications Workshops, pp. 619–624, 2011.

[7] R. Bhagwan, D. Moore, S. Savage, and G. M. Voelker, “Replication strategies for highly available peer-to-peer storage,” Future directions in distributed computing, pp. 153–158, 2003.

[8] R. Rodrigues and B. Liskov, “High availability in dhts: Erasure coding vs. repli-cation,” Peer-to-Peer Systems IV 4th International Workshop IPTPS, pp. 226–239, 2005.

(62)

[9] H. Weatherspoon and J. Kubiatowicz, “Erasure coding vs. replication: A quantitative comparison,” Peer-to-Peer Systems 1st International Workshop (IPTPS), pp. 328– 337, 2002.

[10] Y. Hu, C.-M. Yu, Y. K. Li, P. P. Lee, and J. C. Lui, “On the practicality and extensibility of a network-coding-based distributed file system,” IEEE International Symposium on Network Coding (NetCod), pp. 1–6, 2011.

[11] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,” IEEE Transactions on Informa-tion Theory, vol. 56, no. 9, pp. 4539–4551, 2010.

[12] A. G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh, “A survey on network codes for distributed storage,” Proceedings of the IEEE, vol. 99, no. 3, pp. 476–489, 2011. [13] Q. Yu, K. W. Shum, and C. W. Sung, “Minimization of storage cost in distributed

storage systems with repair consideration,” Global Telecommunications Conference (GLOBECOM), pp. 1–5, 2011.

[14] S. Pawar, S. El Rouayheb, and K. Ramchandran, “On secure distributed data stor-age under repair dynamics,” IEEE International Symposium on Information Theory Proceedings (ISIT), pp. 2543–2547, 2010.

[15] N. B. Shah, K. Rashmi, and P. V. Kumar, “Information-theoretically secure regen-erating codes for distributed storage,” IEEE Global Telecommunications Conference, pp. 1–5, 2011.

[16] S. Pawar, S. El Rouayheb, and K. Ramchandran, “Securing dynamic distributed stor-age systems from malicious nodes,” IEEE International Symposium on Information Theory Proceedings, pp. 1452–1456, 2011.

[17] P. F. Oliveira, L. Lima, T. T. Vinhoza, J. Barros, and M. M´edard, “Trusted storage over untrusted networks,” IEEE Global Telecommunications Conference, pp. 1–5, 2010.

[18] N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran, “Regenerating codes for dis-tributed storage networks,” Arithmetic of Finite Fields 3rd International Workshop (WAIFI), pp. 215–223, 2010.

[19] R. Ahlswede, N. Cai, S.-Y. Li, and R. W. Yeung, “Network information flow,” IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1204–1216, 2000.

[20] R. Koetter and M. M´edard, “An algebraic approach to network coding,” IEEE/ACM Transactions on Networking, vol. 11, no. 5, pp. 782–795, 2003.

[21] W. Qiao, J. Li, and J. Ren, “An efficient error-detection and error-correction (edec) scheme for network coding,” IEEE Global Telecommunications Conference, pp. 1–5, 2011.

分散式網路編碼儲存系統預防線路竊聽問題之研究

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

分散式網路編碼儲存系統預防線路竊聽問

題之研究

A Link Eavesdropping Prevention Problem in

Distributed Network Coded Data Storage

Systems

研 究 生：廖振宏

指導教授：王蒞君 教授

共同指導教授：王國禎 教授

分散式網路編碼儲存系統預防線路竊聽問題之研究

A Link Eavesdropping Prevention Problem in Distributed Network

Coded Data Storage Systems

研 究 生：廖振宏 Student：Chen-Hung Liao

指導教授：王蒞君 Advisor：Li-Chun Wang

共同指導教授：王國禎 Co-Advisor：Kuo-Chen Wang

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

分散式網路編碼儲存系統預防線路竊聽問題之研究

學生：廖振宏

指導教授：王蒞君

共同指導教授：王國禎

國立交通大學

資訊學院資訊科學與工程研究所

摘要

近年來，雲端運算 (Cloud Computing) 的發展相當的快速，它提供了更

多方便且可擴張的服務，雲端分散式儲存系統就是其中之一。在雲端分散式

儲存系統中，網路編碼 (Network Coding) 技術扮演著關鍵的角色，它具有高

可 靠 度 以 及低 儲存 花 費 的 優點 。然 而 因 為 其需 要更 多 的 遠 端修 復頻 寬

(Remote Repair Bandwidth)，當遠端備份資料中心進行修復時面臨嚴重的線路

竊聽問題。在本篇論文中，針對線路竊聽問題，提出最佳化技術的分析模組，

依使用者不同安全性需求，得到最小資料儲存量的理論值。我們的結果顯示，

使用者安全性需求與儲存花費存在相互影響的理論關係。在此分析模組下，

我們進一步探討使用者安全性需求與其他重要儲存系統參數的設計問題。

A Link Eavesdropping Prevention Problem in

Distributed Network Coded Data Storage Systems

A THESIS Presented to

The Academic Faculty By

Chen-Hung Liao

2013

Abstract

Acknowledgments

Contents

List of Tables

List of Figures

Glossary of Symbols

CHAPTER 1

Introduction

1.1

Motivation

1.2

Problem and Solution

1.3

Thesis Outline

CHAPTER 2

Background

2.1

Replication

2.2

Erasure Coding

2.3

Network Coding

2.4

Literature Survey

CHAPTER 3

System Model and Problem

Formulation

3.1

System Model

研究生：廖振宏

指導教授：王蒞君教授

共同指導教授：王國禎教授

研究生：廖振宏 Student：Chen-Hung Liao

國立交通大學

資訊科學與工程研究所

碩士論文

可靠度以及低儲存花費的優點。然而因為其需要更多的遠端修復頻寬